s4hts / HTStream

A high throughput sequence read toolset using a streaming approach facilitated by Linux pipes
https://s4hts.github.io/HTStream/
Apache License 2.0
49 stars 9 forks source link

Flag use #235

Closed emmadoughty closed 3 years ago

emmadoughty commented 4 years ago

Hi there, I'm using SuperDeduper but am a bit stuck on what the flags mean to be able to optimise it. The -q/-a flags require a quality score as the argument with a value 1-10,000. What is this quality score based on? What does it also mean to "have the read written automatically" and "consider a read informative"? Many thanks!

samhunter commented 3 years ago

Hi @emmadoughty sorry to reply so late.

The -q option is part of a speed & memory saving tradeoff. SuperDeduper takes sort of a heuristic approach to deciding which representation of a PCR duplicate to keep. In a world with unlimited system memory, we would like to write out the PCR duplicate with the highest average quality (or perhaps use all copies of the PCR duplicate to error correct). Unfortunately this requires keeping the best representative for every set of PCR duplicates in memory until all reads have been processed. This is impossible for any large data set, especially when PCR duplication rates are low. The alternative is to keep the first read that passes some minimum average quality score, ensuring that the PCR duplicate written out is "pretty good" and probably the best or nearly as good as the best. This approach is fast and uses very little memory.

The -a option is used to filter reads that have very low quality base calls in the "key" interval defined by -s and -l. If a base within this region is too low of quality, the region isn't considered "informative" and the read is thrown out. The idea behind this is that reads with lots of errors in the key are likely to have unique, random keys by chance, when in reality they might be PCR duplicates. The down side of this approach is that runs that have base-calling problems can end up having lots of reads thrown out, so if this happens, it's a good idea to check you read qualities and adjust the -s parameter to avoid the low quality region.