open2c / distiller-nf

A modular Hi-C mapping pipeline
MIT License
86 stars 24 forks source link

built-in adaptor trimming #146

Open gspracklin opened 4 years ago

gspracklin commented 4 years ago

This feature would make life easier (i.e. filter low quality and reads with adaptor). @golobor

This has worked for me so far (using fastp).

Screenshot 2019-12-31 12 49 49
Phlya commented 4 years ago

Does it actually change the results significantly? And how much extra time does it need?

gspracklin commented 4 years ago

My guess is that it doesn't significantly change the results right now. However, as Illumina read lengths increase it might become more of a problem. More specifically, because I don't think it's possible to increase the insert size without disrupting bridge amplification (perhaps not with patterned flows cells) so as read length increase the number of sequences with adapter could increase. Also, isn't trimming just generally recommended as good practice?

I'll try to get around to timing the differences at some point.

golobor commented 4 years ago

I like the implementation - fastp seems like a good package/reliable dependency and we can specify the exact trimming sequence in the config file.

Though, I do have some doubts whether it's entirely necessary. On one hand, a recent report on biorxiv <https://www.biorxiv.org/content/10.1101/833962v1 > showed that trimming is not needed for RNA-seq data if local aligners are used. This happens because the adapter part of a read would form a separate alignment (or, most likely, null/non-unique alignment) which won't affect counting. On the other hand, pairtools parse can be too smart for its own good - we take into account the number and relative order of alignments in a read, such that an adapter at the 3' end can effectively convert a "pair" into a "walk". For this reason, trimming can actually improve results in extreme cases. On the third hand, trimming modifies sequences, so that the final bams won't contain raw sequences anymore. This may screw over people who would store sequencing data in bams as opposed to fastq. Tbh, I know exactly zero labs who do that (DCIC rely on their own pipeline).

So, all in all, it's slightly complicated, but I think making it optional is not a bad idea. George, let me know if you're interested in making a PR

On Fri, 3 Jan 2020 at 16:22, George Spracklin notifications@github.com wrote:

My guess is that it doesn't significantly change the results right now. However, as Illumina read lengths increase it might become more of a problem. More specifically, because I don't think it's possible to increase the insert size without disrupting bridge amplification (perhaps not with patterned flows cells) so as read length increase the number of sequences with adapter could increase. Also, isn't trimming just generally recommended as good practice?

I'll try to get around to timing the differences at some point.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mirnylab/distiller-nf/issues/146?email_source=notifications&email_token=AAG64CVACJVBPKNKIO54FNTQ35J37A5CNFSM4KBXCML2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIBLKDQ#issuecomment-570602766, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAG64CT5OR3IZ4VC5IBJOXLQ35J37ANCNFSM4KBXCMLQ .

agalitsyna commented 4 years ago

I cannot agree completely with the statement that "the adapter part of a read would form a separate alignment (or, most likely, null/non-unique alignment) which won't affect counting."

First, in the case of multiple mapping read, the presence of the adapter might easily force the mapping to one particular genomic site, although the real location is unknown.

Second, the paper from Liao&Shi relies on only ~1000 genes quantified by RT-PCR, which might not include cases with multiple mappings. This result cannot be easily transferred to the mapping of whole-genome data in Hi-C-related methods, where we certainly have many more locations that ~1000 genes.

Third, Hi-C-related methods with complex ligation procedures emerge and they require adapters trimming sometimes, e.g. Hi-CO https://doi.org/10.1016/j.cell.2018.12.014 , MARGI: https://dx.doi.org/10.1016%2Fj.cub.2017.01.011 It might be great to account for "pair-oriented" methods like that.