nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
864 stars 694 forks source link

Option to keep chimeric reads after UMI deduplication #1373

Open siddharthab opened 1 week ago

siddharthab commented 1 week ago

Description of feature

Following up on https://github.com/nf-core/rnaseq/pull/1369#discussion_r1744079079.

@MatthiasZepper Please take over this issue.

MatthiasZepper commented 1 week ago

While reviewing #1369, I noticed that we have set the parameter --chimeric-pairs=discard for umi-tools and wondered if that is actually a good default choice. I planned to briefly discuss that in the #rnaseq_dev Slack channel, but since it is now an official issue, we can also track it here :-)

Purely from a biological view, particularly the transcriptome alignments may comprise a significant amount of chimeric read pairs, simply because of an unannotated splice variant or because of an antisense long non-coding RNA spanning several annotated transcripts. Also, many users use the pipeline on cancer data, where fusion genes or chromosomal rearrangements are to be expected.

However, I have in the meantime read in the UMI-tools FAQ that disabling the option significantly increases the memory demands, so the computational complexity clearly argues for disregarding this complexity by default and leave it to the users of the pipeline to look at chimeric transcripts specifically, if of interest.