smithlabcode / falco

A C++ drop-in replacement of FastQC to assess the quality of sequence read data
https://falco.readthedocs.io
GNU General Public License v3.0
90 stars 10 forks source link

[Feature request] Add option to subsample reads #35

Closed y9c closed 1 year ago

y9c commented 1 year ago

In most of the time, we run falco (fastqc) to have a rough estimation on data quality. So we do not need to parse every single read in the fastq file. I think we can randomly subsample certain amount/ fraction of reads to increase the speed or save computational power.

It is possible to add a argument (--subsample/-s) for this?

Thanks!

toddrichmond commented 1 year ago

This would be incredibly useful to me as well.

guilhermesena1 commented 1 year ago

that's definitely a useful feature, an easy thing to implement, and I can take care of it in the coming days!

guilhermesena1 commented 1 year ago

Thanks for the suggestion to add this functionality! I gave it a shot at 009c28e . You definitely seem to be onto something, in that only processing every 20th read, for example, seems to give very similar results in most FASTQ files but is more than twice as fast processing the full dataset :) might need more testing to make sure nothing broke and would welcome any feedback.

guilhermesena1 commented 1 year ago

I'm going to close this one (this feature was added at 1.1.0), but if there are further problems with the flag or it doesn't work as intended, please feel free to reopen or create a new issue!