Analysing UMI processed reads through Arriba

saty89 commented 2 weeks ago

Hi,

We currently use QIAseq RNAFusion XP Assay and corresponding software to call for RNA fusions. Arriba as the second caller helps confirm or identify missing fusions. It performs well with any RNA assay (thanks for a great tool).

I am building an Arriba workflow to use UMI-processed consensus reads as input (groups based on the same/similar UMI and sequences). As UMI-processed already takes care of PCR duplicates, Any thoughts on the following some tool clarification would be appreciated.

Turning off -u (dedup) in Arriba and no external dedup as well, since the reads are already UMI processed. Will there be a problem doing so?
Noticed that when -u was turned on with UMI-processed reads as input, Arriba was still filtering out reads as duplicates. Does Arriba internal duplication use both start and stop coordinates for dedup marking? I believe these might be biological rather than technical, but want to understand the working better.

EDITED: clarifying that 1. using -u and 2. not using -u

Thanks for your help in advance!

suhrig commented 2 weeks ago

Turning off -u (dedup) in Arriba and no external dedup as well, since the reads are already UMI processed. Will there be a problem doing so?

That is no problem. In fact, this is precisely what this parameter was developed for. When you use -u, Arriba expects the input reads to either have been marked as duplicates using BAM_FDUP (1024) or the duplicates must have been merged.

Noticed that when -u was turned on with UMI-processed reads as input, Arriba was still filtering out reads as duplicates. Does Arriba internal duplication use both start and stop coordinates for dedup marking? I believe these might be biological rather than technical, but want to understand the working better.

If you use the flag -u, Arriba will remove alignments which have the flag BAM_FDUP (1024) set. If you don't use -u, then Arriba will remove alignments with identical start/end coordinates to another read. In your case, it sounds like there shouldn't be any reads with BAM_FDUP set, since duplicates were merged according to the UMIs. So Arriba should not filter any reads using the duplicates filter. Can you run samtools view -f1024 -c /path/to/your/file.bam. The output should be 0. If you can confirm this, then Arriba shouldn't remove any duplicates (when using -u). This would be a bug.

saty89 commented 2 weeks ago

Hi suhrig,

Thank you for your prompt response.

If you use the flag -u, Arriba will remove alignments which have the flag BAM_FDUP (1024) set. If you don't use -u, then Arriba will remove alignments with identical start/end coordinates to another read. In your case, it sounds like there shouldn't be any reads with BAM_FDUP set, since duplicates were merged according to the UMIs. So Arriba should not filter any reads using the duplicates filter. Can you run samtools view -f1024 -c /path/to/your/file.bam. The output should be 0. If you can confirm this, then Arriba shouldn't remove any duplicates (when using -u). This would be a bug.

Just to clarify on the second point, I was a little surprised when performed tests with and without the -u option where my input fastq files were already UMI-processed (UMI-reads output from CLC workflow). The BAM input would then be STAR alignment instead of my own BAM file.

When using the general workflow without using -u flag, Arriba still had reads filtered out using the duplicates filter, which got me thinking if Arriba is using both start and end coordinates to identify identical reads, the duplicates must be biological (due to transcript expression abundance) rather than technical if that makes sense. Wanted to ask for your expert opinion if such cases have been seen before with UMIs or if I am missing something else.

I ran on one STAR bam file,

samtools view -f1024 -c Aligned.sortedByCoord.out.bam
0

When using the flag -u, I can confirm Arriba did not filter out reads with duplicates filter. Thanks!

suhrig commented 2 weeks ago

the duplicates must be biological (due to transcript expression abundance) rather than technical if that makes sense. Wanted to ask for your expert opinion if such cases have been seen before with UMIs or if I am missing something else.

Your assumption is correct. These are biological duplicates. This happens even when UMIs are used, typically in genes with high expression.

saty89 commented 2 weeks ago

Thanks for confirming the observation. Regards!

suhrig / arriba

Analysing UMI processed reads through Arriba #252