GVec error - Githubissues

RAWWiberg commented 3 years ago

Hi, I'm having some trouble running Stringtie2 (v2.1.4) to assemble mapped isoseq subreads.

I first trim adapters.

lima subreads.bam adapters.fasta subreads.bam

Convert .bam to .fastq

samtools fastq -@ 30 -0 .subreads.fastq subreads.bam

The input .fastq file has 41,609,568 subreads with a mean length of 2,518.6 (range = 51 - 246,426). I have then mapped the reads to the reference genome with minimap2.

minimap2 -ax splice -t 30 -uf --secondary=no -C5 genome.fasta subreads.fastq | samtools view -b > subreads_mapped2genome.bam samtools sort -@ 50 -o subreads_mapped2genome.srt.bam subreads_mapped2genome.bam

Then I run Stringtie2 with the -L option:

stringtie subreads_mapped2genome.srt.bam -p 50 -L -v -l stringtie-isoseq-GG -o isoseq_stringtie.gtf

It runs find and seems to produce the temporary output .gtf file but eventually it stops and returns the message:

GVec error: invalid count: -2006062490

I found the lines in the GVec.hh file where this error is printed but I'm afraid I can't really make sense of what this actually is trying to tell me.

Can you help me figure out what is going on?

Happy to provide more information if needed.

Also, it would be really great if some additional documentation/recommendations for running stringtie with long reads. At the moment there is hardly anything. E.g. do you recommend trimming adapters poly-A tails from subreads? if so what would you use? I have now trimmed adapters with lima but left the poly-A tails in.

gpertea commented 3 years ago

That GVec error suggests an integer overflow happening (that value should never be negative). It is possible there is a specific, very large bundle causing this issue. If you run it without -p option (but keeping the -v option) and capture the stderr log, we might get a better idea about what is happening there (though the error might happen during the formation of the bundle, we'll still have an idea about the genomic region where that happens, looking at the last bundle processed).

If this is indeed a bundle processing issue and if you can share the alignments for the bundle causing this, could you please follow the instructions here: https://github.com/gpertea/stringtie/wiki/Extracting-bundle-data-for-debugging in order to prepare the bundle BAM file and share it with me?

RAWWiberg commented 3 years ago

Hi, Thanks for the quick reply. OK I'll run it again as you suggest to see where things are going wrong and get back to you. Cheers, Axel

gpertea commented 3 years ago

As it crashes please look at the last line of the log - if it starts with ^bundle and has the word "done" following the genomic location it means the problem is during the formation of the next bundle and we won't really know where that huge bundle starts and ends. If that is the case I'll provide another way to determine the actual location/extent of the bundle.

RAWWiberg commented 3 years ago

Hi again,

I can report that I get the same error, below is the error and the line of the log immediately before the crash:

[02/17 17:21:27]>bundle tig00053199:293-35345953 [7779819 alignments (3993315 distinct), 740658 junctions, 0 guides] begins processing... GVec error: invalid count: -2006062490 So I followed the instructions at the link you sent and tried it again for the subset. Extracting bundles for debugging

samtools index subreads_mapped2genome.bam

Extract the alignments from that bundle

samtools view -b subreads_mapped2genome.bam tig00053199:293-35345953 > bundle_tig00053199.bam

Try with this subset

stringtie bundle_tig00053199.bam -L -v -l bundle_stringtie-isoseq-GG -o ./bundle_isoseq_stringtie.gtf

It takes a while but I get the output:

`Default stack size for threads: 8388608 (increased to 8388608) [02/17 19:23:42]>bundle tig00053199:293-35345953 [7779819 alignments (3993315 distinct), 740658 junctions, 0 guides] begins processing...

GVec error: invalid count: -2006062490 `

It looks like it's failing on a super long contig (35Mb), and it looks like stringtie is trying to process most of it at the same time (>700k alignments). This subset of the .bam file is quite large (7.8Gb). Is there a preferred way to get such a file to you if you think you need it?

Cheers, Axel

gpertea commented 3 years ago

That's a monster bundle with 7 million alignments (4 million distinct) and 740k junctions. Processing such a bundle, with such a large number of junctions can also take a lot of memory, but I did not expect the GVec limit to be hit first :). I'll send you an email with the option to upload this bundle to a ftp server. Meanwhile if it's easier for you to upload to a cloud storage service and share the download link with me, please go ahead.

I think it would be good be increase the stringency of the aligner, so perhaps the alignments in that BAM file can be filtered to eliminate some of the spurious junctions. If you can also upload the FASTA sequence for that contig it may be helpful to filter/validate the splice sites.

pynie1 commented 3 years ago

Hi, I have the same problem when running Stringtie2 (v2.1.4) on Illumina data. Here is my commend: hisat2 --dta -p 6 --max-intronlen 5000000 --rna-strandness RF -x Homo_sapiens.GRCh38.genome -1 Input.1.fq -2 Input.2.fq -S Output.HISAT_aln.sam >hisat2_running.log samtools view -F 4 -Su Output.HISAT_aln.sam | samtools sort - Output.accepted_hits
samtools index Output.accepted_hits.bam stringtie Output.accepted_hits.bam -G Homo_sapiens.GRCh38.gtf -l Test --rf -o Test.transcripts.gtf And then error: [02/18 09:33:40]>bundle 17:508668-50898713 [4642422 alignments (520171 distinct), 9425 junctions, 4354 guides] begins processing... GVec error: invalid count: -1175242126

Could you please give me some advice to solve it? Looking forword to your reply.

beaferbl commented 3 years ago

Hi, I get a similar error with some BAM files: GVec error: invalid index: 7232 GVec error: invalid index: 1411 This occurs at the beginning of the processing of certain bundles: a 5Mb bundle with 4M alignments (400k distinct) in the first case, and a 4.8M bundle with 530k alignments (300k distinct) in the second one. Is the size of the bundles causing the process to stop? They are higher than other bundles that are done. I am working in a virtual machine. Does the number reported in the error (i.e. 7232, 1411) give a hint about the error itself?

melop commented 8 months ago

I can confirm a similar error when using --mix option in v2.2.0 when the bam files are large. Is there a way to filter out some of these spurious junctions, especially those spanning many Mb?

pratarora commented 8 months ago

Hello! we are also having a similar error. We tried also with the -j 3 , but it then gets stuck at another bundle at some other place. If you want we can send in our bam file for the same. We are using v2.2.1.

Any help is appreciated!

Thanks a lot!

gpertea commented 8 months ago

Can you clarify what the error message says exactly? If the GVec error message shows a negative value, that is a possible sign of integer overflow (bundles may be too large). But if the numbers are positive, it could be a different issue that should be investigated more closely.

@melop can you try v2.2.1 to see if you get the same error? (and tell me what the exact error message was)

@pratarora it would be great to make the BAM file (and any other input files) available to me. It would be even better if you could isolate and share only a specific bundle that triggers the error by following the suggestions here: https://github.com/gpertea/stringtie/wiki/Extracting-bundle-data-for-debugging

pratarora commented 8 months ago

@gpertea Thanks for the prompt reply! Our Gvec is a negative value. GVec error: invalid count: -2071514734 It does seem to move forward if we make the gap (-g) bigger to 200. We will work to get the bundle and email it to you soon. Thanks a lot for your help!

Best regards Prateek

gpertea commented 7 months ago

@pratora, my apologies for the delay, I got the bundle data you shared (thank you!) and thanks to it, we were able to find the problem, we'll fix it in the next few days.

Leaving a technical note here for me and @mpertea :

Problem happens in the re-allocation of the overlap array at this line in print_predcluster(). npred is 47154 which makes the product overflow the max signed int value.

gpertea commented 7 months ago

Fixed by commit 5f21416 Fix included in release v2.2.2. @pratarora let me know if that fully solves the problem on your data, which seem to have very large bundles.

pratarora commented 7 months ago

Thanks a lot @gpertea. We will try it out and get back to you!

pratarora commented 6 months ago

@gpertea We tried the new version and it seems to be working for us. thanks a lot!

gpertea commented 6 months ago

Thank you so much for the confirmation. This still ties into a larger issue about 64bit conversion being needed for some internal data structures, that I have to address, but I'll close this for now as it seems to address this particular limitation in the case of complex bundles with long reads.

gpertea / stringtie

GVec error #320