onecodex / finch-rs

A genomic minhashing implementation in Rust
https://www.onecodex.com
MIT License
92 stars 8 forks source link

What does filtering entail(--no-filter flag)? #20

Closed DoreenM closed 6 years ago

DoreenM commented 6 years ago

What does the fastq filtering entail? Its not described in detail in the readme. Does it refer to masking low quality reads or 5' 3' trimming? What are the filtering defaults?

bovee commented 6 years ago

FASTQ filtering is just the two filters described in the documentation. The default error rate for the count-based filtering is 1% and the default strand filter is 10% (so anything that's seen at 9x greater abundance in the forward vs reverse orientation is filtered). It's probably not completely obvious, but you can also see these with finch sketch --help.

Adding low-quality trimming for FASTQs is an open issue for us; in general this shouldn't be a problem for most data because the count filter works fairly well, but we have seen some really bad sequencing runs where there are enough errors to cause extra distance between similar samples.

Quality filtering should also cover most 5'/3' trimming except for adapters. For those, the "strandedness" filter should work since library preps generally capture both strand directions and adapters are always only present in the forward sense.

Hope this answers your questions, but please let me know if you have others!

boydgreenfield commented 6 years ago

@DoreenM Closing this issue, but feel free to re-open or open a new one if you have more questions!