suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
227 stars 49 forks source link

Frequence of the split reads #247

Closed litun-fkby closed 3 months ago

litun-fkby commented 4 months ago

Hi,i use the arriba to detect the fusion and arriba is useful , and i also have some question about the arriba . The option -U MAX_READS :Subsample fusions with more than the given number of supporting reads,default 300,and i notice the max value is about 30000,so the value of split_read1 and split_read2 is changed by this option,but the value of coverage1 and coverage2, which coverage calculation counts all fragments (even duplicates) and no subsample , so the frequence of the split reads changed by this option . So , can arriba have a option to calculate coverage use the same conditions as the split_reads , such as both subsample reads and filter the duplicates.

suhrig commented 4 months ago

Arriba does not provide an option for what you are asking, unfortunately. However, you can achieve it using the following manual steps:

  1. Subsampling really is just capping. If you cap the values in coverage1/2 to 300, then you do the same as Arriba does for the supporting reads.

  2. Perform duplicate detection prior to passing the BAM file to Arriba, for example using sambamba markdup or Picard MarkDuplicates and remove them from the BAM file. This way, Arriba will never see the duplicates and won't count them anywhere.

litun-fkby commented 4 months ago

the fastq data produce by the amplicon,so if i using "sambamba markdup" the most reads will be filtered . If i use the option -u with no markdup bam, the Arriba will use all reads to detect fusion ? The coverage1/2 include split1/2 or just the reads near breakpoint with no split reads which span the breakpoint?

suhrig commented 4 months ago

coverage1/2 always counts reads in the BAM file - regardless of whether they were marked as duplicates. If you don't want coverage1/2 to count them, they must be removed from the BAM (not just marked).

split_reads1/2 ignores duplicates unless the duplicates filter was disabled using -f duplicates. Which reads Arriba considers as duplicates depends on the parameter -u. Without the parameter, Arriba performs duplicate detection internally. With the parameter, Arriba relies on duplicates being marked as such inside the BAM file by a preceding tool.

If i use the option -u with no markdup bam, the Arriba will use all reads to detect fusion?

Yes. That's in fact the same as disabling duplicate filtering via -f duplicates as described above.

The coverage1/2 include split1/2 or just the reads near breakpoint with no split reads which span the breakpoint?

Clipped reads are ignored by coverage1/2. Admittedly, the name may be a bit misleading here, since it doesn't capture the full coverage, but only "high-quality" alignments. Reads with clipped segments are ignored, because they are usually poor/incorrect alignments. In hindsight, I probably should have implemented this differently, because it confuses some users. It's similar with the split_reads1/2, which also only count high-quality chimeric reads and put others in the filters column. The complete read count is actually the sum of split_reads1/2 and filters. There are some technical reasons for this. In practice, this doesn't matter so much.

If you have amplicon data where the start and end coordinates are the same for all reads and you have UMIs, it is best to perform duplicate marking prior to running Arriba and removing all duplicates as determined by the UMIs. This way, split_reads1/2 and coverage1/2 are most comparable and fusion calling should be most reliable.

litun-fkby commented 3 months ago

Thank you for reply!