suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
214 stars 50 forks source link

Recommendation for samtools bam>fastq conversion for Arriba #207

Closed min-codes closed 9 months ago

min-codes commented 11 months ago

Hi @suhrig ,

Thank you for developing Arriba with such comprehensive documentation.

I am planning to run Arriba using TCGA RNAseq data, and currently, only the genomic BAMs are available as they have stopped maintaining the GDC legacy archive. I'm looking into using samtools fastq to convert BAM to FASTQ, but unsure if it's best for fusion detection to retain all reads, or only the properly mapped/ paired ones. If I understand fusion detection correctly I'd think that the unmapped ones are probably more important. How about unpaired reads and non-primary alignments?

ie. How important are these reads for fusion detection? Would it be best to keep all of them (assuming that the recommended STAR parameters would already handle these), or is any redundant/ irrelevant?

(I ran samtools flagstat on one of the genomic BAM, for your reference. It seems that the duplicates were already removed.)

160199987 + 0 in total (QC-passed reads + QC-failed reads)
144929586 + 0 primary
15161002 + 0 secondary
109399 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
158767693 + 0 mapped (99.11% : N/A)
143497292 + 0 primary mapped (99.01% : N/A)
144929586 + 0 paired in sequencing
72464793 + 0 read1
72464793 + 0 read2
141687912 + 0 properly paired (97.76% : N/A)
142711464 + 0 with itself and mate mapped
785828 + 0 singletons (0.54% : N/A)
665464 + 0 with mate mapped to a different chr
665464 + 0 with mate mapped to a different chr (mapQ>=5)

Thanks in advance, Min

suhrig commented 11 months ago

Hi Min,

For fusion detection, you need all mapped and unmapped paired-end reads, but only the primary alignments, i.e., no secondary hits or supplementary alignments. The latter are not required, because they will be regenerated during realignment. They are only additional alignments of the same read.

The stats from the file you showed have some singletons. Those could potentially be useful for fusion detection, but would be lost during conversion to FastQ, because STAR cannot deal with a mixture of paired-end and single-end reads. They make up only a small fraction of the reads, so it should be fine.

I typically use one of the following two commands to convert BAM files back to FastQ for reprocessing:

bamtofastq filename=/path/to/bam_file F=read1.fastq.gz F2=read2.fastq.gz gz=1 disablevalidation=1 S=singletons.fastq.gz > other.fastq.gz
samtools collate -f -O -u -r 1000000 /path/to/bam_file |
samtools fastq -0 other.fastq.gz -1 read1.fastq.gz -2 read2.fastq.gz -s singletons.fastq.gz

You can then pass on the files read1.fastq.gz and read2.fastq.gz to run_arriba.sh.

In case you haven't noticed, there is a script to run Arriba on BAMs that is more efficient than converting to FastQ and realigning everything. However, it can only be used if you want to align to an assembly with compatible coordinates, e.g. from hg19 to GRCh37. It can't be used if the files are in hg19 and you want to use hg38.

Lastly, depending on how the TCGA BAM file was created, you may be able to run Arriba directly on it. Can you paste the output of the following command?

samtools view -H /path/to/bam_file | grep '^@PG'

It seems that the duplicates were already removed.

Maybe they weren't removed but simply not marked. Either way, they are irrelevant to fusion detection and would have been removed by Arriba anyway.

Regards, Sebastian