Recommendation for samtools bam>fastq conversion for Arriba

suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data

Other

214 stars 50 forks source link

160199987 + 0 in total (QC-passed reads + QC-failed reads) 144929586 + 0 primary 15161002 + 0 secondary 109399 + 0 supplementary 0 + 0 duplicates 0 + 0 primary duplicates 158767693 + 0 mapped (99.11% : N/A) 143497292 + 0 primary mapped (99.01% : N/A) 144929586 + 0 paired in sequencing 72464793 + 0 read1 72464793 + 0 read2 141687912 + 0 properly paired (97.76% : N/A) 142711464 + 0 with itself and mate mapped 785828 + 0 singletons (0.54% : N/A) 665464 + 0 with mate mapped to a different chr 665464 + 0 with mate mapped to a different chr (mapQ>=5)

Hi Min,

For fusion detection, you need all mapped and unmapped paired-end reads, but only the primary alignments, i.e., no secondary hits or supplementary alignments. The latter are not required, because they will be regenerated during realignment. They are only additional alignments of the same read.

The stats from the file you showed have some singletons. Those could potentially be useful for fusion detection, but would be lost during conversion to FastQ, because STAR cannot deal with a mixture of paired-end and single-end reads. They make up only a small fraction of the reads, so it should be fine.

I typically use one of the following two commands to convert BAM files back to FastQ for reprocessing:

bamtofastq filename=/path/to/bam_file F=read1.fastq.gz F2=read2.fastq.gz gz=1 disablevalidation=1 S=singletons.fastq.gz > other.fastq.gz

samtools collate -f -O -u -r 1000000 /path/to/bam_file |
samtools fastq -0 other.fastq.gz -1 read1.fastq.gz -2 read2.fastq.gz -s singletons.fastq.gz

You can then pass on the files read1.fastq.gz and read2.fastq.gz to run_arriba.sh.

In case you haven't noticed, there is a script to run Arriba on BAMs that is more efficient than converting to FastQ and realigning everything. However, it can only be used if you want to align to an assembly with compatible coordinates, e.g. from hg19 to GRCh37. It can't be used if the files are in hg19 and you want to use hg38.

Lastly, depending on how the TCGA BAM file was created, you may be able to run Arriba directly on it. Can you paste the output of the following command?

samtools view -H /path/to/bam_file | grep '^@PG'

It seems that the duplicates were already removed.

Maybe they weren't removed but simply not marked. Either way, they are irrelevant to fusion detection and would have been removed by Arriba anyway.

Regards, Sebastian

suhrig / arriba

Recommendation for samtools bam>fastq conversion for Arriba #207