adding transcript IDs to the output

MariusGheorghe commented 4 years ago

Hi Sebastian,

Was wondering if before the new major release is out, is it easy enough/straightforward to "just" add the transcript IDs in the output of arriba? This is an essential piece of information for our downstream analyses and currently it is pretty cumbersome/resource intensive to reverse-engineer from breakpoints and gene names to the transcript IDs in the input data.

Thank you very much in advance!

Marius

suhrig commented 4 years ago

Hi Marius,

I am already working on this, since this has been asked in issue #51 just recently. I'll push the code to the development branch next week (hopefully). I'll keep you posted.

Regards, Sebastian

MariusGheorghe commented 4 years ago

Hi Sebastian,

Thank you for the quick reply. Was not aware it was already asked in a different issue. I was skimming through the titles of the issues and did not find one specifically addressing this point. Sorry for the potential duplicate.

That is really good news :] As a side note, is it possible/easy enough to display also at which position in the original protein, the reported peptide sequence starts? Or if easier, to display the entire protein sequence? That would also be a very helpful piece of information. I am sorry if I seem too demanding :D

Thank you very much in advance and looking forward to your update.

Marius

suhrig commented 4 years ago

Was not aware it was already asked in a different issue.

No worries. I did not mean to criticize. In fact, I consider it a positive sign if I get the same request from different people, because it means that the issue is important.

As a side note, is it possible/easy enough to display also at which position in the original protein, the reported peptide sequence starts?

If anything I would instruct Arriba to print the full sequence, because the interpretation is intuitive. This could be tricky, though. I will need to think about it after another look at the code. Please give it some time. May I ask how you would use this information? Most users are only interested in the junction peptide sequence, because this one is most relevant to fusion-derived neoantigen prediction.

MariusGheorghe commented 4 years ago

Hi again,

Well, we wanted to include some protein-level features in our model. That's why knowing where the reported peptide sequence starts would be useful. But if outputting (also) the entire protein sequence is easier, that is useful as well. Thank you again for your time.

Marius

suhrig commented 4 years ago

I just pushed an enhanced version of Arriba to the develop branch. It includes two new columns transcript_id1 and transcript_id2, which should contain the information you need. You can obtain the new version using these commands:

git clone https://github.com/suhrig/arriba.git
cd arriba
git checkout develop
make

The parameter -P must be specified for the transcript ID columns to be filled.

Regarding your second request: I have not yet found the time to report the full protein sequence. I will keep you updated.

Please beware that this is a development version of Arriba. Although it should run fine in practice, it has not been tested as thoroughly as official releases. Also, there are definitely going to be additional changes to the output format (new columns added, some rearranged) in the next official release, so please be flexible with regards to changes.

MariusGheorghe commented 4 years ago

Hi Sebastian,

Cool. That's great news :] Will give it a try and let you know if everything is OK. Might take some days as the data is pretty big.

Of course, will keep in mind it's a dev version and yes, we are planning to make a flexible parser for the Arriba output, so changing column order/names shouldn't be that big of a deal.

Looking forward to the next big release with new features and eventually the full protein sequence :stuck_out_tongue_closed_eyes:

Thank you once again for your time and support. Have a good corona-free weekend.

Marius

MariusGheorghe commented 4 years ago

Hi Sebastian,

As promised, I'm getting back to you after trying the development branch. For one of our samples, the run finished successfully and the transcript IDs were indeed present in the output. This run was performed on pair-ended FASTQ files.

I've launched a second run this time on one huge FASTQ file (~165GB) and all looked ok until:

.....
[2020-03-24T12:24:49] Filtering end-to-end fusions with low support (remaining=248)
[2020-03-24T12:24:49] Filtering fusions with no coverage around the breakpoints (remaining=232)
[2020-03-24T12:24:50] Indexing gene sequences
..... 
<stalled for like 2.5h and reached RAM usage >95% (86.7 GB / 88.5 GB) >
....
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
run_star_arriba.sh: line 57:   272 Done                    STAR --runMode alignReads --runThreadN 24 --genomeDir /ref/STAR_index_v2.5.3a --genomeLoad NoSharedMemory --readFilesIn "${rna1}" --readFilesCommand zcat --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0 --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --chimSegmentMin 10 --chimOutType WithinBAM SoftClip --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 --outStd BAM_Unsorted
       273 Aborted                 (core dumped) | $ARRIBA -x /dev/stdin -o /output/fusions.tsv -O /output/fusions.discarded.tsv -a $REFGENOME -g $REFGTF -b ${BLACKLIST} -T -P

but it didn't output anything:

root@6e92d9d11deb:/app/src# ll /output
total 8
drwxr-xr-x 2 root root 4096 Mar 24 11:37 ./
drwxr-xr-x 1 root root 4096 Mar 24 11:37 ../
root@6e92d9d11deb:/app/src# ll /output

Note that I was running it within a Docker container. But this was the case for the first run also. Same configuration of the container.

Maybe this is not related, thus a different issue, but just to let you know in case you have some idea.

In any case, I would say this issue is closed and once again I thank you for your time and support.

Marius

suhrig commented 4 years ago

[2020-03-24T12:24:50] Indexing gene sequences ..... <stalled for like 2.5h and reached RAM usage >95% (86.7 GB / 88.5 GB) >

This is unrelated to the transcript ID, but nevertheless interesting from a stability point of view. 165GB is quite something. This is RNA-Seq data, right? The biggest RNA sample I have tested Arriba on so far was about 1/3rd the size, which I would already consider massive.

Can I ask you for a favor or two:

Can you send me the complete output of Arriba? I want to see the total number of chimeric reads, fusion candidates and times spent in each step.
It's odd that the indexing step would consume so much memory when there are only 232 candidates left to filter. Finding out the root cause would require some debugging. If I sent you a debug version of Arriba that generates additional output during the filtering step, would you be willing to compile and run it for me on this sample? I have a suspicion what might be going wrong here, but it would require me to take a look at the genomic coordinates of the candidates in this step. I can completely understand if you say no, because you don't have the time for that.

MariusGheorghe commented 4 years ago

Sure. Will do that. Just deleted the file to make space on the VM for other analyses, but I can generate it back :] Will try to launch it this evening if my other run is finished by then, if not tomorrow.

To give you a bit of details about the input file:

Yes, it's RNA-seq
Initially this file was created also from pair-ended reads in FASTQ format. We ran HiSAT2 on it (after quality trimming) and now I've converted it back from BAM to FASTQ using bamtools convert -in rna_hisat2_sort.bam -format fastq > rna_hisat2.fastq
We fed it as gzipped FASTQ into the STAR aligner.

BTW, do you think it is possible to use the output from HiSAT2 in Arriba instead of STAR? STAR is a lot more memory intensive as it holds a lot in memory instead of spitting to some tmp from time to time.. which is a bit of a pain for us.

As I mentioned, I launched the analysis under a Docker container that we normally use in production. I've mounted the location of the new Arriba version (development) to use it for the run, but of course I had to recompile it under the Docker image as I got some libm.so line 6 error which seems to be gcc related.

Now, as you said, I also found it weird that for 232 hits it spent 2.5h but wanted to let it finish. In the output folder there was no file whatsoever so I'll try to log everything that STAR/Arriba spits out.

Will get back to you when it's done

suhrig commented 4 years ago

Great, thanks! Can you replace the file source/filter_mismappers.cpp with this one before compiling? It prints some progress messages while Arriba is constructing the index. The progress messages contain the genomic ranges of the candidate fusion genes in the format <chromosome>:<start>-<end>. If you cannot paste this information here, because it is confidential, it would suffice if you could just tell me whether there are some large regions among them; by "large" I mean regions in the range of millions of bases. My suspicion is that there is one really big region (most likely the last one to be printed). If that's not the case, then I won't bother you any longer, because without having access to the raw data myself, the issue would be difficult to track down.

BTW, do you think it is possible to use the output from HiSAT2 in Arriba instead of STAR?

Unfortunately, no. Another user had the same request (#46) and wanted to open an issue with the developers of HiSat2. Not sure if that has happened yet. Arriba was built for interoperability, since it supports the SAM format. Now, the developers of HiSat2 have to do their part in making their tool standards-compliant.

MariusGheorghe commented 4 years ago

Hi Sebastian,

Ran again on the same file and here's the log:

[2020-03-25T10:30:28] Launching Arriba 1.2.0
[2020-03-25T10:30:28] Loading annotation from '/ref/Homo_sapiens.GRCh38.95.sort.gtf' 
[2020-03-25T10:30:36] Loading assembly from '/ref/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna' 
[2020-03-25T10:31:28] Reading chimeric alignments from '/dev/stdin' (total=WARNING: 408849 SAM records were malformed and ignored
2757407)
[2020-03-25T11:20:55] Detecting strandedness (reverse)
[2020-03-25T11:20:55] Assigning strands to alignments 
[2020-03-25T11:20:56] Annotating alignments 
[2020-03-25T11:21:31] Filtering duplicates (remaining=2481790)
[2020-03-25T11:21:36] Filtering mates which do not map to interesting contigs (1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y) (remaining=2400629)
[2020-03-25T11:21:37] Estimating fragment length WARNING: not enough chimeric reads to estimate mate gap distribution, using default values
[2020-03-25T11:21:38] Filtering read-through fragments with a distance <=10000bp (remaining=2297329)
[2020-03-25T11:21:39] Filtering inconsistently clipped mates (remaining=2297329)
[2020-03-25T11:21:40] Filtering breakpoints adjacent to homopolymers >=6nt (remaining=1747603)
[2020-03-25T11:21:45] Filtering fragments with small insert size (remaining=1747603)
[2020-03-25T11:21:46] Filtering alignments with long gaps (remaining=1747603)
[2020-03-25T11:21:47] Filtering fragments with both mates in the same gene (remaining=1746903)
[2020-03-25T11:21:48] Filtering fusions arising from hairpin structures (remaining=1689770)
[2020-03-25T11:21:49] Filtering reads with a mismatch p-value <=0.01 (remaining=838132)
[2020-03-25T11:21:59] Filtering reads with low entropy (k-mer content >=60%) (remaining=685497)
[2020-03-25T11:22:07] Finding fusions and counting supporting reads (total=WARNING: some fusions were subsampled, because they have more than 300 supporting reads
707601)
[2020-03-25T11:22:18] Merging adjacent fusion breakpoints (remaining=703560)
[2020-03-25T11:22:20] Filtering multi-mapping fusions by alignment score and read support (remaining=703560)
[2020-03-25T11:22:24] Estimating expected number of fusions by random chance (e-value) 
[2020-03-25T11:22:32] Filtering fusions with both breakpoints in adjacent non-coding/intergenic regions (remaining=698569)
[2020-03-25T11:22:33] Filtering intragenic fusions with both breakpoints in exonic regions (remaining=657566)
[2020-03-25T11:22:34] Filtering fusions with <2 supporting reads (remaining=36008)
[2020-03-25T11:22:34] Filtering fusions with an e-value >=0.3 (remaining=2460)
[2020-03-25T11:22:35] Filtering fusions with both breakpoints in intronic/intergenic regions (remaining=1628)
[2020-03-25T11:22:36] Filtering PCR/RT fusions between genes with an expression above the 99.8% quantile (remaining=1628)
[2020-03-25T11:22:41] Searching for fusions with spliced split reads (remaining=1688)
[2020-03-25T11:22:44] Selecting best breakpoints from genes with multiple breakpoints (remaining=1544)
[2020-03-25T11:22:46] Searching for fusions with >=4 spliced events (remaining=1554)
[2020-03-25T11:22:47] Filtering blacklisted fusions in '/new_arriba/arriba/database/blacklist_hg38_GRCh38_2018-11-04.tsv.gz' (remaining=458)
[2020-03-25T11:23:00] Filtering fusions with anchors <=23nt (remaining=289)
[2020-03-25T11:23:00] Filtering end-to-end fusions with low support (remaining=248)
[2020-03-25T11:23:01] Filtering fusions with no coverage around the breakpoints (remaining=232)
[2020-03-25T11:23:01] Indexing gene sequences 

<.... regions were here ....>

terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
run_star_arriba.sh: line 57:   461 Done                    STAR --runMode alignReads --runThreadN 24 --genomeDir /ref/STAR_index_v2.5.3a --genomeLoad NoSharedMemory --readFilesIn "${rna1}" --readFilesCommand zcat --outSAMtype BAM Unsorted --outSAMunmapped Within --outBAMcompression 0 --outFilterMultimapNmax 1 --outFilterMismatchNmax 3 --chimSegmentMin 10 --chimOutType WithinBAM SoftClip --chimJunctionOverhangMin 10 --chimScoreMin 1 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentReadGapMax 3 --outStd BAM_Unsorted
       462 Aborted                 (core dumped) | $ARRIBA -x /dev/stdin -o /output/fusions.tsv -O /output/fusions.discarded.tsv -a $REFGENOME -g $REFGTF -b ${BLACKLIST} -T -P

I've removed the regions from the log (data sensitivity as you suspected :] ), but I've extracted the absolute difference between start and end. The summary of the vector of region sizes:

    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    224   14440   34786   67154   77979  503030

and the vector itself (104/232 entries):

Looks like there's no very large region. Is the debug printout put before processing one region or after? Just to make sure

Hope this helps you debugging

Marius

suhrig commented 4 years ago

Thanks, that was very helpful! It was indeed a bug and I pushed a fix to the develop branch just now. The root cause wasn't a large region, as I initially suspected, but rather the read length was not estimated correctly (undefined, really). It is added as padding around the regions and therefore caused a crash in the indexing step.

Anyhow, now that I see the full log, there is definitely something wrong with your data, too, and you certainly want to fix this, because it impairs the quality of fusion calling. It seems, when you convert the BAM file back to FastQ, a single-end FastQ file is created. You lose a lot of sensitivity if the paired-end information is discarded. I usually use bamtofastq from the biobambam package to convert BAM back to FastQ:

bamtofastq F=read1.fastq.gz F2=read2.fastq.gz collate=1 gz=1 filename=/path/to/your/bam_file.bam

It generates two FastQ files for paired-end data. The important bit is collate=1 which preserves the order of mates.

MariusGheorghe commented 4 years ago

you mean this line?

[2020-03-25T11:21:37] Estimating fragment length WARNING: not enough chimeric reads to estimate mate gap distribution, using default values

yeah, i figured that probably it's not the best conversion from BAM to FASTQ as it pretty much exploded in size. I just wanted to try it on that sample because the initial data file was pretty large. will take your advice for the conversion tool :]

suhrig commented 4 years ago

Yes, that's the line. When this message is shown, either you are using a reeeeeally small dataset or something is wrong with the data.

MariusGheorghe commented 4 years ago

good to know. thanks man! please keep me updated if you are adding that full protein sequence in the next release

suhrig commented 4 years ago

Will do.

One last remark: if a large memory footprint is an issue for you, you can generate a STAR index with --genomeSAsparseD 2, which cuts the index size (and thus memory footprint) in half, at the expense of longer runtimes. Higher values reduce the footprint even more. Still, the total runtime should be shorter than running both HiSat2 and STAR.

MariusGheorghe commented 4 years ago

Thanks for the tip. It was more related to the memory footprint of the alignReads part of STAR. But good to know about that parameter when indexing.

suhrig commented 4 years ago

The parameter reduces the memory footprint during alignment as well. STAR's memory footprint during alignment is only so big, because the index is so big. If the index size is cut in half, the memory footprint is cut in half, too.

MariusGheorghe commented 4 years ago

oh.. okay. will give it a try. Thanks! :metal:

suhrig / arriba

adding transcript IDs to the output #53