[Question]; How is % duplication calculated compared to FastQC

smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data

https://falco.readthedocs.io

GNU General Public License v3.0

96 stars 10 forks source link

[Question]; How is % duplication calculated compared to FastQC #54

Closed tamuanand closed 1 year ago

tamuanand commented 1 year ago

I have a question on Falco and how it calculates % duplication.

FastQC uses the first 100K different sequences to deal with duplication (duplication and overrepresented sequences) as quoted by the FastQC author here: https://github.com/s-andrews/FastQC/issues/64#issuecomment-727840599

Question: How does Falco calculate this - does it also use the first 100K different sequences?

Thanks in advance.

andrewdavidsmith commented 1 year ago

It is set at 100k also. The relevant line of code is here: https://github.com/smithlabcode/falco/blob/20b2a858ba193bd9850951050ccf6da14f185da4/src/FalcoConfig.hpp#L153

(I know it's tough to grep, I had to try a few times because it wasn't "1000000")

This can be changed by modifying the source, and of course there is a chance of some bug. If you think you see inconsistent results, please create an issue or feel free to provide the relevant information to reproduce the problem in this issue.

tamuanand commented 1 year ago

Thanks. No, I am not seeing any inconsistent results - I was just curious to know how this was done in Falco.