suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data
Other
214 stars 50 forks source link

Arriba and STARfuison align #239

Closed xiucz closed 3 weeks ago

xiucz commented 1 month ago

Hi, The picture below illustrates the differences between the alignment results of Arriba (v2.4.0, STAR2.7.8a) and Starfusion (v1.10.0, STAR2.7.8a). In Starfusion's BAM file, there are numerous softclip reads that support the TIMM23--LINC00843 fusion events (chr10:51606988--chr10:51732772). It appears that the parameter settings used by Arriba's STAR alignment discarded many of these reads. Could you provide me with some suggestions?

arriba code(use the raw run_arriba.sh script)

arriba_v2.4.0//run_arriba.sh ~/arriba_v2.4.0//database//hg19_GENCODE37lift37/STAR_index_hg19_GENCODE37lift37/ ~/arriba_v2.4.0//database//hg19_GENCODE37lift37/GENCODE37lift37.gtf ~/arriba_v2.4.0//database//hg19_GENCODE37lift37/hg19.fa ~/arriba_v2.4.0//database//blacklist_hg19_hs37d5_GRCh37_v2.4.0.tsv ~/database//known_fusions_hg19_hs37d5_GRCh37_v2.4.0.tsv ~/arriba_v2.4.0//database//protein_domains_hg19_hs37d5_GRCh37_v2.4.0.gff3 6 ~/trim/A2_1.trim.fastq.gz ~/trim/A2_2.trim.fastq.gz

STARfusion code

~/STAR-Fusion-v1.10.0/STAR-Fusion --left_fq ~/trim/A2_1.trim.fastq.gz --right_fq ~/trim/A2_2.trim.fastq.gz --genome_lib_dir ~/hg19_GENCODE37lift37/STARFusion/GRCh37_gencode_v19_CTAT_lib_Mar012021.source/ctat_genome_lib_build_dir --CPU 6 --STAR_PATH ~/STAR-2.7.8a/bin/Linux_x86_64_static/STAR --examine_coding_effect --tmpdir  ~/tmp/ --output_dir STARfusion

image

Best, xiucz

suhrig commented 1 month ago

Hi, could you kindly pick one read which is aligned by STAR-Fusion but not by Arriba and extract the paired-end reads from the Arriba BAM file using for example samtools view arriba_alignments.bam | grep -w READ_NAME? I'm curious if the reads are aligned at all. Moreover, if I have the read sequence, then I can try to figure out which parameters are responsible. Thanks!

xiucz commented 1 month ago

Thanks for your reply. I grep one read here. arriba

$ samtools view Aligned.sortedByCoord.out.bam|grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367 99  chr10   51381825    255 22M2452N59M3295N69M =   51387704    351037  GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG  FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF  NH:i:1  HI:i:1  AS:i:295    nM:i:2  NM:i:2
V350121616L4C005R0660558367 147 chr10   51387704    255 60M345008N90M   =   51381825    -351037 AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG  9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF  NH:i:1  HI:i:1  AS:i:295    nM:i:2  NM:i:0

STARfusion

 $ samtools view Aligned.out.sort.bam |grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367 355 chr10   51381825    3   22M2452N59M3295N69M =   51387704    5939    GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG  FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF  NH:i:2  HI:i:2  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:+
V350121616L4C005R0660558367 403 chr10   51387704    3   60M90S  =   51381825    -5939   AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG  9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF  NH:i:2  HI:i:2  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:+
V350121616L4C005R0660558367 163 chr10   51606988    3   90S60M  =   51607030    5938    CACCAGTTGGGATTGACACACACCTCTACAGCTGGGTTCAGGAGCCAGACAGCAGCACACAGAACGGGGAATCCAAAGCCATCTCTGACACTGTACATTTATACAACATGCCTGTCATGGTTCCAGCTGCTACTGTGTTAAGGTCATCTT  FGEFGGFFGDFFFFFAGGGFFEGGGGGBGGGFGCFGFFGGGFFGFFFGEGFFGFGGEEGGFF>GFFFG<FGGFGEFDGFFDGGCFGF7GGFGEFEGGBGGCGGGFFFFFEFGFFFCGGCGFCGGGGFFEDFAEFFGFFFABFFGGFGG@9  NH:i:2  HI:i:1  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:-
V350121616L4C005R0660558367 83  chr10   51607030    3   69M3294N59M2452N22M =   51606988    -5938   CTGTGTTAAGGTCATCTTCTGCACCTCGTGTTTTCTCAATGATGACACCAAATGCACTATAGAGCAACGCCAGAGAACCTAGAGTATTAGCCCAAAGTGCCCCTTGCCTAGTCACCATATTCAAAATCTGTACATTTCCTGGTTTGGACC  FGFFFFGGGF?FF=AFFFFGFFFFGEFFFFFFFFFFFFFDFFCFFGFFFGFGFFFFGFGAFGFFEFFEFFGG<GFGFFFCFGFFFGFGFFFFGFGFFFDFDDFFGFFFGFFGFFFF??GGFFFFFEFFDFBFFGFFF<FFF>FFGGFFFF  NH:i:2  HI:i:1  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:-

And here is a zoom-in screenshot, image Best, xiucz

xiucz commented 1 month ago

Thanks for your reply, I grep one soft-clip read in STARfusion. arriba

$samtools view Aligned.sortedByCoord.out.bam|grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367 99  chr10   51381825    255 22M2452N59M3295N69M =   51387704    351037  GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG  FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF  NH:i:1  HI:i:1  AS:i:295    nM:i:2  NM:i:2
V350121616L4C005R0660558367 147 chr10   51387704    255 60M345008N90M   =   51381825    -351037 AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG  9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF  NH:i:1  HI:i:1  AS:i:295    nM:i:2  NM:i:0

STARfusion

$ samtools view Aligned.out.sort.bam |grep  V350121616L4C005R0660558367
V350121616L4C005R0660558367 355 chr10   51381825    3   22M2452N59M3295N69M =   51387704    5939    GGTCCAAACCAGGAAATGTACAGATTTTGAATATGGTGACTAGGCAAGGGGCACTTTGGGCTAATACTCTAGGTTCTCTGGCGTTGCTCTATAGTGCATTTGGTGTCATCATTGAGAAAACACGAGGTGCAGAAGATGACCTTAACACAG  FFFFGGFF>FFF<FFFGFFBFDFFEFFFFFGG??FFFFGFFGFFFGFFDDFDFFFGFGFFFFGFGFFFGFCFFFGFG<GGFFEFFEFFGFAGFGFFFFGFGFFFGFFCFFDFFFFFFFFFFFFFEGFFFFGFFFFA=FF?FGGGFFFFGF  NH:i:2  HI:i:2  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:+
V350121616L4C005R0660558367 403 chr10   51387704    3   60M90S  =   51381825    -5939   AAGATGACCTTAACACAGTAGCAGCTGGAACCATGACAGGCATGTTGTATAAATGTACAGTGTCAGAGATGGCTTTGGATTCCCCGTTCTGTGTGCTGCTGTCTGGCTCCTGAACCCAGCTGTAGAGGTGTGTGTCAATCCCAACTGGTG  9@GGFGGFFBAFFFGFFEAFDEFFGGGGCFGCGGCFFFGFEFFFFFGGGCGGBGGEFEGFGG7FGFCGGDFFGDFEGFGGF<GFFFG>FFGGEEGGFGFFGEGFFFGFFGGGFFGFCGFGGGBGGGGGEFFGGGAFFFFFDGFFGGFEGF  NH:i:2  HI:i:2  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:+
V350121616L4C005R0660558367 163 chr10   51606988    3   90S60M  =   51607030    5938    CACCAGTTGGGATTGACACACACCTCTACAGCTGGGTTCAGGAGCCAGACAGCAGCACACAGAACGGGGAATCCAAAGCCATCTCTGACACTGTACATTTATACAACATGCCTGTCATGGTTCCAGCTGCTACTGTGTTAAGGTCATCTT  FGEFGGFFGDFFFFFAGGGFFEGGGGGBGGGFGCFGFFGGGFFGFFFGEGFFGFGGEEGGFF>GFFFG<FGGFGEFDGFFDGGCFGF7GGFGEFEGGBGGCGGGFFFFFEFGFFFCGGCGFCGGGGFFEDFAEFFGFFFABFFGGFGG@9  NH:i:2  HI:i:1  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:-
V350121616L4C005R0660558367 83  chr10   51607030    3   69M3294N59M2452N22M =   51606988    -5938   CTGTGTTAAGGTCATCTTCTGCACCTCGTGTTTTCTCAATGATGACACCAAATGCACTATAGAGCAACGCCAGAGAACCTAGAGTATTAGCCCAAAGTGCCCCTTGCCTAGTCACCATATTCAAAATCTGTACATTTCCTGGTTTGGACC  FGFFFFGGGF?FF=AFFFFGFFFFGEFFFFFFFFFFFFFDFFCFFGFFFGFGFFFFGFGAFGFFEFFEFFGG<GFGFFFCFGFFFGFGFFFFGFGFFFDFDDFFGFFFGFFGFFFF??GGFFFFFEFFDFBFFGFFF<FFF>FFGGFFFF  NH:i:2  HI:i:1  AS:i:207    nM:i:2  RG:Z:GRPundef   XS:A:-

Alignment stat: image And here is zoom-in screenshot for the select read: image Best, xiucz

suhrig commented 1 month ago

Thanks for providing the details.

Without any further evidence, I would argue that Arriba is right here. It finds a better alignment than STAR-Fusion. The alignment proposed by Arriba is linear and even matches an annotated transcript of a gene perfectly (namely, NR_158654.1 of gene TIMM23B-AGAP6). The alignment proposed by STAR-Fusion has the same number of mismatches, but it is chimeric and therefore has a poorer alignment score (which is why Arriba discards it).

The reason why STAR-Fusion prefers the chimeric alignment over the linear one is simply because it uses a very low value for --alignIntronMax (if I remember correctly 100kb). This value is not biologically meaningful, since many splicing events are larger than this. STAR-Fusion has to use such a low value, because otherwise it would not be able to detect fusions from focal deletions. Arriba does not have this limitation and can use larger values for --alignIntronMax. For this reason, it finds the linear alignment, which is most likely an effect of normal splicing.

Do you have any reason to believe that this fusion is real? For example, do you have structural variants from whole-genome sequencing data to confirm this fusion? If not, then I find it likely that the fusion prediction made by STAR-Fusion is wrong in this case.

xiucz commented 1 month ago

Thank you, and I will provide further information if I have any.

xiucz commented 3 weeks ago

In the end, I could not further validate this sample. However, the same fusion event was detected in the WBC of the patient, leading me to believe that STARfusion's fusion event is erroneous. Thank you for your assistance.

suhrig commented 3 weeks ago

Thanks for your feedback!