mbhall88 / rasusa

Randomly subsample sequencing reads or alignments
https://doi.org/10.21105/joss.03941
MIT License
207 stars 17 forks source link

Relationship to filtlong #16

Closed tseemann closed 4 years ago

tseemann commented 4 years ago

FYI - a comment from a colleague:

You can still use filtlong with the below settings to 
focus on quality only and to more or less ignore length in the scoring metric

--min_length 500
--mean_q_weight 10
--length_weight 1
--target_bases $((DEPTH * GENOMESIZE))
mbhall88 commented 4 years ago

Yes, this is exactly how I have been using filtlong previously (exact same weights and all). There is still some filtering of read length happening here, which is a (subtle) bias. I am very keen to keep this project out of the filtering business as there are already great tools for this.
Removing the --min_length option here obviously is much more unbias, but still, there is a scoring system at work, which is not strictly random. In my experience with these weightings, it does not focus purely on quality, there is definitely still some length-favouring that happens. I guess my aim with rasusa was to provide as little parameters as possible. i.e. users don't need to play with scoring weights etc. Maybe I am being silly and everyone will keep using filtlong, which is also fine.

I have a section in the motivation where I mention how filtlong can be co-opted to do something similar. Do you think I need to provide better clarification around filtlong?

I don't know if this is also of interest, but in my local benchmarking rasusa was significantly faster than filtlong. But I don't feel comfortable focusing on this as I am not trying to compete with filtlong.

tseemann commented 4 years ago

No worries - all good - thanks for explaination.