nf-core / eager

A fully reproducible and state-of-the-art ancient DNA analysis pipeline
https://nf-co.re/eager
MIT License
140 stars 80 forks source link

Clarify "very short reads" in helptext of `clip_readlength` #887

Closed TCLamnidis closed 2 years ago

TCLamnidis commented 2 years ago

The current helptext reads:

Defines the minimum read length that is required for reads after merging to be considered for downstream analysis after read merging. Default is 30. Note that performing read length filtering at this step is not reliable for correct endogenous DNA calculation, when you have a large percentage of very short reads in your library - such as retrieved in single-stranded library protocols. When you have very few reads passing this length filter, it will artificially inflate your endogenous DNA by creating a very small denominator. In these cases it is recommended to set this to 0, and use --bam_filter_minreadlength instead, to filter out 'un-usable' short reads after mapping.

We should clarify what "very short reads in your library" means. To my understanding that would be a length distribution peak below 20bp. The added computational work to map all sequenced fragments is considerable, and this approach can be avoided when the length distribution peak is still within 20/25bp. In such cases I think users could lower the clip_readlength without actually setting it to 0 and avoid all the extra computation while still getting an Endo % that is comparable to that given with default settings.