Missing Transcript IDs in ARRIBA Output Files

suhrig / arriba

Fast and accurate gene fusion detection from RNA-Seq data

Other

214 stars 50 forks source link

Missing Transcript IDs in ARRIBA Output Files #211

Closed bioPG closed 8 months ago

bioPG commented 10 months ago

Title: Missing Transcript IDs in ARIBA Output Files

Description: While using ARIBA, I have noticed an issue. In the two output files generated by ARIBA, namely <sample>.arriba.fusions.tsv and <sample>.arriba.fusions.discarded.tsv, only the transcript_id1 and transcript_id2 fields in <sample>.arriba.fusions.tsv have corresponding transcript IDs, whereas the transcript_id1 and transcript_id2 fields in the <sample>.arriba.fusions.discarded.tsv file do not have associated values.

Issue Details:

The transcript_id1 and transcript_id2 fields have associated transcript IDs in <sample>.arriba.fusions.tsv.
The transcript_id1 and transcript_id2 fields do not have associated transcript IDs in <sample>.arriba.fusions.discarded.tsv.
Is there a parameter that can be set to also output the corresponding transcript IDs in the <sample>.arriba.fusions.discarded.tsv file?

Environment and Version Information:

ARIBA Version: 2.4.0
Operating System: Linux

Expected Behavior: I would like to understand if there is a way, either through parameter settings or another method, to output the associated transcript IDs with the transcript_id1 and transcript_id2 fields in the <sample>.arriba.fusions.discarded.tsv file.

Additional Information: A screenshot of the content of the <sample>.arriba.fusions.tsv file is shown below:

A screenshot of the content of the <sample>.arriba.fusions.discarded.tsv file is shown below:

Thank you for your assistance!

suhrig commented 10 months ago

Thank you for the detailed issue report.

Some columns which are computationally demanding and/or lengthy are not populated in the discarded file, because the fusion candidates in this file are mostly irrelevant anyway. You can run Arriba with the switch -X to populate them. Beware that it may take a while to generate the output file and it will be quite big, then.

On an unrelated note: May I ask which tool you use to display tables on the command-line with aligned columns?

bioPG commented 10 months ago

Thank you very much for your prompt response to my questions. I may need to recover some specific events from the <sample>.arriba.fusions.discarded.tsv file, along with their corresponding transcript IDs, so I will try the -X parameter to retrieve them. By the way, I use the csvtk pretty command to view CSV and TSV files. For more details, you can refer to https://github.com/shenwei356/csvtk. You can think of it as the equivalent of R tidyverse in Linux.

bioPG commented 9 months ago

Hello, following your suggestion, I used the -X parameter to process data from two sequencing batches. Currently, all samples have been successfully analyzed except for one, which encountered an analysis failure. The tail error message for the failed sample is as follows:

WARNING: encountered early stop codon in transcript NM_001301020 at amino acid 67 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_201563 at amino acid 57 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_201563 at amino acid 57 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_201563 at amino acid 57 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_201563 at amino acid 57 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_145728 at amino acid 272 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_145728 at amino acid 272 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_145728 at amino acid 272 (error in GTF file?) => predicted peptide sequence may be wrong
WARNING: encountered early stop codon in transcript NM_145728 at amino acid 272 (error in GTF file?) => predicted peptide sequence may be wrong
.command.sh: line 17: 20965 Segmentation fault      arriba -x NL25RF0125.Aligned.out.bam -a hs37d5.fa -g RefSeq.gtf -o NL25RF0125.arriba.fusions.tsv -O NL25RF0125.arriba.fusions.discarded.tsv -b blacklist_hg19_hs37d5_GRCh37_v2.4.0.tsv.gz -k known_fusions_hg19_hs37d5_GRCh37_v2.4.0.tsv.gz -X -p protein_domains_hg19_hs37d5_GRCh37_v2.4.0.gff3

Running script:

#!/bin/bash -euo pipefail
if [ $(samtools view -f 2048 NL25RF0125.Aligned.out.bam | head | wc -l) -gt 0 ];then
    arriba \
        -x NL25RF0125.Aligned.out.bam \
        -a hs37d5.fa \
        -g RefSeq.gtf \
        -o NL25RF0125.arriba.fusions.tsv \
        -O NL25RF0125.arriba.fusions.discarded.tsv \
        -b blacklist_hg19_hs37d5_GRCh37_v2.4.0.tsv.gz \
        -k known_fusions_hg19_hs37d5_GRCh37_v2.4.0.tsv.gz \
        -X \
         \
        -p protein_domains_hg19_hs37d5_GRCh37_v2.4.0.gff3 \

else
    touch NL25RF0125.arriba.fusions.tsv NL25RF0125.arriba.fusions.discarded.tsv
fi

cat <<-END_VERSIONS > versions.yml
"NFCORE_RNAFUSION:RNAFUSION:ARRIBA_WORKFLOW:ARRIBA":
    arriba: $(arriba -h | grep 'Version:' 2>&1 |  sed 's/Version: //')
END_VERSIONS

suhrig commented 9 months ago

Is it possible to share the BAM file with me? This would make debugging easier. If not, I can send you instructions for collecting debug info.

Thank you for showing me csvtk. This tool looks awesome!

bioPG commented 9 months ago

I'm truly sorry, but providing clinical sequencing data for a patient without their explicit consent exceeds my authority. However, if you could offer guidance or methods for collecting debug information, I will do my utmost to follow your instructions and provide feedback. Thank you for your understanding.

suhrig commented 9 months ago

Here are the commands to compile Arriba with debug information and to obtain a stack trace.

$ git clone https://github.com/suhrig/arriba
$ cd arriba
$ make 'CXXFLAGS=-g -Wall -Wno-parentheses -pthread -std=c++0x -O2'
$ gdb ./arriba
(gdb) run -x rna.bam ...
Segmentation Fault
(gdb) bt

Explanation:

Download the arriba source code.
Compile with debug info (-g).
Launch the GNU debugger (gdb) with the newly compiled arriba as an argument.
Inside gdb, type run followed by all the arguments that you would pass to arriba such as -x rna.bam.
Wait for the segfault to happen.
Type bt to print a stack trace.

Please send me the output of the bt command.

bioPG commented 9 months ago

This is my script:

(gdb) run -x /data-turbo/RNA-SEQ_result/230913_A00932_0919_AHGK2NDRX3/work/e4/50a8dd250c91c986eb97229c1d8a6c/NL25RF0125.Aligned.out.bam -a /data2/dev_projects/dujun/database/rnafusion/hs37d5/hs37d5.fa -g /data2/dev_projects/dujun/database/rnafusion/gtf/RefSeq/RefSeq_hg19.gtf -o /data-turbo/RNA-SEQ_result/230913_A00932_0919_AHGK2NDRX3/work/9c/66e814261997d92cf505dd8451a2d8/NL25RF0125.arriba.fusions.tsv -O /data-turbo/RNA-SEQ_result/230913_A00932_0919_AHGK2NDRX3/work/9c/66e814261997d92cf505dd8451a2d8/NL25RF0125.arriba.fusions.discarded.tsv -b /data2/dev_projects/dujun/database/rnafusion/arriba/blacklist_hg19_hs37d5_GRCh37_v2.4.0.tsv.gz -k /data2/dev_projects/dujun/database/rnafusion/arriba/known_fusions_hg19_hs37d5_GRCh37_v2.4.0.tsv.gz -p /data2/dev_projects/dujun/database/rnafusion/arriba/protein_domains_hg19_hs37d5_GRCh37_v2.4.0.gff3 -X

The output of the bt command is:

(gdb) bt
#0  0x0000000000483a6f in assign (__c=<error reading variable: Cannot access memory at address 0x7ffff6e31dfe>, __n=1, 
    this=0x7fffffffc1a0) at /usr/include/c++/4.8.2/bits/basic_string.h:1145
#1  operator= (__c=<error reading variable: Cannot access memory at address 0x7ffff6e31dfe>, this=0x7fffffffc1a0)
    at /usr/include/c++/4.8.2/bits/basic_string.h:567
Python Exception <type 'exceptions.ValueError'> Cannot find type const pileup_t::_Rep_type: 
#2  get_sequence_from_pileup (pileup=std::map with 16 elements, breakpoint=98962917, direction=false, gene=0x4d27ae0, 
    assembly=std::unordered_map with 25 elements = {...}, sequence="", positions=std::vector of length 0, capacity 0, 
    clipped_sequence="") at source/output_fusions.cpp:156
#3  0x0000000000487c6c in get_fusion_transcript_sequence (fusion=..., 
    assembly=std::unordered_map with 25 elements = {...}, sequence=".", positions=std::vector of length 0, capacity 0)
    at source/output_fusions.cpp:288
#4  0x000000000048cff9 in write_fusions_to_file (fusions=std::unordered_map with 3917198 elements = {...}, 
    output_file="/data-turbo/RNA-SEQ_result/230913_A00932_0919_AHGK2NDRX3/work/9c/66e814261997d92cf505dd8451a2d8/NL25RF0125.arriba.fusions.discarded.tsv", coverage=..., assembly=std::unordered_map with 25 elements = {...}, 
    gene_annotation_index=..., exon_annotation_index=..., 
    original_contig_names=std::vector of length 115, capacity 115 = {...}, tags=std::unordered_map with 0 elements, 
    protein_domain_annotation_index=..., max_mate_gap=max_mate_gap@entry=50, max_itd_length=max_itd_length@entry=100, 
    print_extra_info=print_extra_info@entry=true, fill_sequence_gaps=fill_sequence_gaps@entry=false, 
    write_discarded_fusions=write_discarded_fusions@entry=true) at source/output_fusions.cpp:1133
#5  0x000000000040b023 in main (argc=<optimized out>, argv=<optimized out>) at source/arriba.cpp:609

suhrig commented 9 months ago

Here is a patch, which should prevent the segfault I hope. May I kindly ask you to run the above instructions again, but this time insert the following command before compilation (i.e., before running make)?

sed -i.bak '155s/)$/ \&\& (unsigned int) position->first < contig_sequence->second.size())/' source/output_fusions.cpp

The issue seems to be that the code tries to access a region beyond the end of a chromosome (on line 156 of output_fusions.cpp). The patch simply adds a check which avoids accesses beyond chromosome boundaries. This does not really fix the underlying issue. Without the sequencing data, it's hard to find the root cause. But this workaround should get the job done. The side-effects are minimal: it simply cannot determine if a base in the fusion sequence is a reference mismatch, so some bases of the fusion sequence will be shown as lowercase letters even if they match the reference genome assembly.

With the above patch, hopefully there won't be a segfault. If there still is one, please run bt again and send me the output. It could be that the underlying issue causes another problem further down in the code. Again, without the data this is hard to foresee.

If you confirm that the above patch fixes your crash, I will merge it into the code base. Thanks for your help!

bioPG commented 9 months ago

I'm delighted to inform you that my issue has been resolved. When I used the latest batch processing, Arriba completed the analysis successfully without any interruptions or error reports. Below is the output of the bt command, and I believe everything is fine now. I truly appreciate your dedicated assistance in resolving this problem.

(gdb) bt
No stack.

By the way, when can I use conda to install this latest version after you merge this batch into the codebase?

suhrig commented 9 months ago

I'm glad to hear this fixed your problem.

Before this will become available via conda, I need to bring out the next release. I don't have a timeline for this yet. I haven't had lots of time to work on Arriba this year and people keep asking me about the next release. Maybe I should abandon the idea of a the next release being a feature release and should just release a bug fix version.

bioPG commented 9 months ago

You could also consider releasing a version that includes only bug fixes, such as 2.4.1. A similar approach has already been applied in software like STAR and GATK, and I believe it is feasible. Of course, this is just my personal suggestion, and I highly respect your decision. Regardless, I am very grateful for your work because your software has greatly benefited me.