Segmentation Fault - Githubissues

14zac2 commented 4 years ago

Hello!

I am trying to use Arriba on a bit of an obscure genome (the woodchuck genome) to detect virus integration from a hybrid genome generated by cellranger mkref. I installed Arriba via Bioconda and ran STAR first to make sure STAR ran successfully before running Arriba. My issue is that I am running into a segmentation fault when running Arriba. My code to run Arriba is as follows:

arriba \ -x Aligned.out.bam \ -f blacklist \ -f intronic \ -g genes.gtf \ -a genome.fa \ -o fusions.tsv \ -O fusions.discarded.tsv \ -I -T \ -i WHV____NC_004107.1

and it gives me this output:

arriba \ -x Aligned.out.bam \ -f blacklist \ -f intronic \ -g genes.gtf \ WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 WARNING: CDS record has unknown transcript ID: MONAX5_unknown_transcript_1 [2020-07-21T08:20:43] Loading assembly from 'genome.fa' [2020-07-21T08:20:48] Reading chimeric alignments from 'Aligned.out.bam'arriba.sh: line 28: 88217 Segmentation fault arriba -x Aligned.out.bam -f blacklist -f intronic -g genes.gtf -a genome.fa -o fusions.tsv -O fusions.discarded.tsv -I -T -i WHV____NC_004107.1

I already know that there are a few potential issues here: I'm not using a blacklist (because of the woodchuck genome), and I'm only specifying the viral chromosome (the woodchuck genome assembly is pretty poor quality so it has tens of thousands of contigs), but I'm not sure why any of this would be giving me a segmentation fault, and there are some unknown transcript IDs (that's fine, I think). But the STAR output in both Bam and Junction format looks look, and all I'm really trying to do is convert contig name and basepair to gene ID. Do you have any advice?

Thanks! Zoe

suhrig commented 4 years ago

Hi Zoe,

I am not sure where the segmentation fault comes from. Arriba should never segfault, but rather throw an error message - even when the provided data is obscure. Without a test sample this is hard to debug. Could you point me to a RNA-Seq sample and the woodchuck / viral genomes, so I can some tests myself?

Some more comments:

-f blacklist \ -f intronic

The two parameters should be merged to one: -f "blacklist intronic". Otherwise the second one will override the first one. I know that this is counter-intuitive, sorry. I have already implemented that this kind of usage causes an error to be thrown in the next release, but in the current release, Arriba silently ignores is.

I'm only specifying the viral chromosome

If you are interested in integration of viral DNA into the woodchuck genome, then you should specify all contigs. Otherwise, only rearrangements within the viral contig are reported, but none across different contigs. Don't worry about the genome being scattered. Arriba should be able to handle it. I have run it with tens of thousands of contigs. On the contrary, I could even imagine that this is the cause of the segfault. So please give it a try specifying all contigs or simply add uninteresting_contigs to the parameter -f, which will cause Arriba to treat all contigs as "interesting".

there are some unknown transcript IDs (that's fine, I think)

Yes, should be harmless.

Regards, Sebastian

14zac2 commented 4 years ago

Hi Sebastian,

Thanks for getting back to me so soon! I can certainly make those adjustments and send you some files to experiment with.

The woodchuck/virus genome I was referring to can be found here: https://www.dropbox.com/sh/oc0d7ic3reqpcsz/AABayr6_51fAVzLWrIqq1TMXa?dl=0

And I can give you one of the fastq files (I had concatenated a bunch) here: https://www.dropbox.com/s/e0dwy7mggqhbuvk/L207_TLH_S7_L001_R2_001.fastq.gz?dl=0

Please let me know if you need anything else!

Thanks again, Zoe

suhrig commented 4 years ago

Thanks for the helpful test data.

Welp, I was wrong. It's indeed the massive number of contigs that's causing the issue (~48000). Arriba can "only" handle ~32000. It should be fairly easy to double this number, though. I will do some testing and - if successful - push a patched version to the develop branch in the upcoming days.

BTW, listing all contigs as an argument to parameter -i probably does not work either, because most shells don't allow such long command lines. Also, disabling the filter uninteresting_contig does not alleviate the problem like I thought. Instead, another error is thrown. So please ignore my suggested workaround above. The only solution is to increase the maximum number of contigs. Stay tuned.

14zac2 commented 4 years ago

Thanks so much!! Yes, I understand that it is a ridiculous number of contigs! I look forward to a potential fix, and really appreciate your time addressing this issue.

suhrig commented 4 years ago

It was indeed really easy to fix. I also enhanced Arriba to report a proper error message, when the maximum number of contigs is exceeded (now ~65000). I just pushed a patch to the develop branch. I tested it successfully on your data. You can obtain it as follows:

git clone https://github.com/suhrig/arriba.git
cd arriba
git checkout develop
make

This version of Arriba also includes the possibility to use wildcards to specify interesting contigs. So in your case, you should simply write -i "*".

14zac2 commented 4 years ago

That's fantastic! I've tried it out, and it works great!! I really appreciate you making this update. Will you be incorporating it into your next official version of Arriba?

I am also wondering your opinion on a frequent result I am seeing: because my reads are short (98bp single end), short sequences are mapped to multiple genes. I see these listed twice in the output file although they are supported by the same reads. I can give you an example:

MONAX5_MONAX_5E004945(116),MONAX5_MONAX_5NCB058415(2095) WHV____WhBvgp1 ./- +/- MONAX5_CABDUW010000203.1:118270 WHV____NC_004107.1:883 intergenic CDS translocation upstream downstream 4 1 0 1061 2518 medium . . . CCACAATCATTCCTGCCGCACAGCCCAGAGTGGCAGTTTTTCATGTTCTTTCCAGTTCTTTCCAGAAGTGTGGGCCTCCA|AAACCAAATCATCCATATAAGCAAAAACCACGCAATGAGGGAAATTCCTC . WHV____WhBvgp1 . . . . D00165:219:CCR13ANXX:1:2101:10037:22158,D00165:219:CCR13ANXX:1:2312:7367:7259,D00165:219:CCR13ANXX:3:1315:7940:3030,D00165:219:CCR13ANXX:3:2314:20863:97672,D00165:219:CCR13ANXX:2:1316:1764:72727 MONAX5_MONAX_5E004945(116),MONAX5_MONAX_5NCB058415(2095) WHV____WhBvgp2 ./- +/- MONAX5_CABDUW010000203.1:118270 WHV____NC_004107.1:883 intergenic CDS translocation upstream downstream 4 1 0 1061 2518 medium . . . CCACAATCATTCCTGCCGCACAGCCCAGAGTGGCAGTTTTTCATGTTCTTTCCAGTTCTTTCCAGAAGTGTGGGCCTCCA|AAACCAAATCATCCATATAAGCAAAAACCACGCAATGAGGGAAATTCCTC . WHV____WhBvgp2 . . . . D00165:219:CCR13ANXX:1:2101:10037:22158,D00165:219:CCR13ANXX:1:2312:7367:7259,D00165:219:CCR13ANXX:3:1315:7940:3030,D00165:219:CCR13ANXX:3:2314:20863:97672,D00165:219:CCR13ANXX:2:1316:1764:72727

You can see that this is the same integration site, but the virus mapped to two spots on the viral genome. Is this supposed to happen? Or perhaps is there a way to list both virus genes as an option in a single line? I know it's probably because my reads are short, and it's not a big deal if I just have to be aware of this and treat the output accordingly.

Thanks again!!

suhrig commented 4 years ago

All of the commits in the develop branch will be part of the next release.

The fusion prediction being listed twice has nothing to do with short reads mapping to multiple loci. It's because the same breakpoint affects two overlapping genes. This is intended behavior, because only the reader knows which of the two genes is of interest. Imagine you are interested in viral integration of gene X. But there is a second gene Y, which overlaps with gene X. Arriba reports both, because it cannot know that you are only interested in X. If you want to count such occurrences as one event you have to ignore lines with identical values in the columns breakpoint1/2.

suhrig / arriba

Segmentation Fault #70