thegenemyers / DASCRUBBER

Alignment-based Scrubbing pipeline
Other
20 stars 7 forks source link

Scrubbing metrics - interpretation help #15

Closed chrisgulvik closed 6 years ago

chrisgulvik commented 6 years ago

The metrics reported when -v is invoked are impressive, and I'd like to make sure I understand them correctly. I'm trying to log changes made from scrubbing based on the categories you mention in your blog that DAStrim and DASpatch perform (chimera breaks, adapter clipping, low quality patches, and low quality discarded) and this post helps with most of the interpretation.

DAStrim -v -c99 -b30 -g23 reads reads.reads.las reports:

    Input:     61,121 (100.0%) reads      644,890,459 (100.0%) bases
    Trimmed:    5,958 (  9.7%) reads       32,998,814 (  5.1%) bases
    5' trim:   32,485 ( 53.1%) reads       44,246,900 (  6.9%) bases
    3' trim:   49,885 ( 81.6%) reads       24,146,502 (  3.7%) bases
    Adapter:   11,721 ( 19.2%) reads       98,326,843 ( 15.2%) bases
    Gaps:      92,030 (150.6%) gaps        49,591,400 (  7.7%) bases
      Low QV:  66,439 (108.7%) gaps        15,178,400 (  2.4%) bases
      Span'd:  23,152 ( 37.9%) gaps        28,154,400 (  4.4%) bases
      Break:    2,439 (  4.0%) gaps         6,258,600 (  1.0%) bases
    Clipped:  102,488 clips               199,719,059 ( 31.9%) bases
    Patched:   89,591 patches              43,332,800 (  6.7%) bases

(1.) Does this mean 11,721 reads with 98,326,843 bp total length had adapters detected, but the full 98 Mbp weren't removed because only a portion of those reads were adaptamers? I'd like to know the total bp discarded due to being adapters.

(2.) How can I identify the total number of bp that are discarded due to low quality? If I understand correctly, some low quality sequences (from DAStrim -b30 above) that lack corresponding reads to be patched by (absent alignments or only aligns to another low QV read segments) would be trimmed off. Would I use 32,998,814 bases and remove the number of masked repeat bases? DASedit -v reads patched_reads reports "5,958 reads and 206,474,159 bases were trimmmed by scrubbing" so maybe 206,474,159 bp - 98,326,843 bp adapters - masked repeat bases?

(3.) Are reads with "patches failed" saved or discarded? DASpatch -v reported "153 out of 89591 total patches failed"

thegenemyers commented 6 years ago

Hi Christopher, find my responses after each question. Gene

On 7/19/18, 11:09 PM, Christopher Gulvik wrote:

The metrics reported when |-v| is invoked are impressive, and I'd like to make sure I understand them correctly. I'm trying to log changes made from scrubbing based on the categories you mention in your blog https://dazzlerblog.wordpress.com/command-guides/dascrubber-command-guide/ that DAStrim and DASpatch perform (chimera breaks, adapter clipping, low quality patches, and low quality discarded) and this post https://dazzlerblog.wordpress.com/2017/04/22/1344/ helps with most of the interpretation.

|DAStrim -v -c99 -b30 -g23 reads reads.reads.las| reports:

Input: 61,121 (100.0%) reads 644,890,459 (100.0%) bases Trimmed: 5,958 ( 9.7%) reads 32,998,814 ( 5.1%) bases 5' trim: 32,485 ( 53.1%) reads 44,246,900 ( 6.9%) bases 3' trim: 49,885 ( 81.6%) reads 24,146,502 ( 3.7%) bases Adapter: 11,721 ( 19.2%) reads 98,326,843 ( 15.2%) bases Gaps: 92,030 (150.6%) gaps 49,591,400 ( 7.7%) bases Low QV: 66,439 (108.7%) gaps 15,178,400 ( 2.4%) bases Span'd: 23,152 ( 37.9%) gaps 28,154,400 ( 4.4%) bases Break: 2,439 ( 4.0%) gaps 6,258,600 ( 1.0%) bases Clipped: 102,488 clips 199,719,059 ( 31.9%) bases Patched: 89,591 patches 43,332,800 ( 6.7%) bases

(1.) Does this mean 11,721 reads with 98,326,843 bp total length had adapters detected, but the full 98 Mbp weren't removed because only a portion of those reads were adaptamers? I'd like to know the total bp discarded due to being adapters.

The number of adaptamers found is unusually high, typically this is 1%, so I'm a bit concerned about the nature of the data set. Assuming however that everything is copacetic, at 11,721 locations, the pattern for unremoved adaptemer was found. In some cases there may be 2 or 3 of these in a single read, so the percentage is perhaps a bit misleading, it is just 11,721/61,121. By way of further illustration the "Gaps" percentage is 150.6% -- which really just says there are on average 1.5(06) gaps found in every read.

Yes, unfortunately 15.2% of all your bases got trimmed away -- suppose a read contains a single unremoved adaptamer. Then the "half" to the left is the complement of the "half" to the right. The scrubber keeps only the longer half. It could keep the second half as another read of the insert in the same well, but that's not how it currently works, as the assumption was the adapter rate is very low (1%). The PacBio softare is suppose to have already detected almost all of these, and my scrubber is just meant to capture the few missed cases. Again I don't know why your rate is so super high.

(2.) How can I identify the total number of bp that are discarded due to low quality? If I understand correctly, some low quality sequences (from |DAStrim -b30| above) that lack corresponding reads to be patched by (absent alignments or only aligns to another low QV read segments) would be trimmed off. Would I use 32,998,814 bases and remove the number of masked repeat bases? |DASedit -v reads patched_reads| reports "5,958 reads and 206,474,159 bases were trimmmed by scrubbing" so maybe 206,474,159 bp - 98,326,843 bp adapters - masked repeat bases?

You lost a total of 199Mbp (31.9%) of your data due to scrubbing. This is unusually high, typically, its 20% or so. That 199Mbp was lost due to either (a) the entire read is deamed junk and thrown away ("Trimmed"), (b) the beginning or ending of the read was pruned back due to low quality ("5' and 3' Trim"), (c) an interior gap of low quality could not be spanned and the gap region is lost when the read is broken into two parts ("Break"), (d) you lost sections of a read due to a missed adapter as described above. So if you add up (a), (b), and (c) I would say that is what you lost to low quality. And then (d), missed adapter, is the other reason for data loss.

(3.) Are reads with "patches failed" saved or discarded? |DASpatch -v| reported "153 out of 89591 total patches failed"

If a patch fails, then the read is split into two parts, each of which is kept (like the "Break" line of "Gaps".

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DASCRUBBER/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/AGkkNscvG71RneVfJ7hzeMMG8oca20Hqks5uIPWXgaJpZM4VXGSV.

chrisgulvik commented 6 years ago

Thanks for the thorough response! I chose a bad example (high adapters from unrolled reads) but your explanations helped. Sorry, I forgot to respond earlier to close this.