pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
648 stars 170 forks source link

Index building for test data crashes #81

Closed iwohlers closed 8 years ago

iwohlers commented 8 years ago

After installing either the latest version from git or source version v0.42.3 from http://pachterlab.github.io/kallisto/download.html on my 32-bit Linux (Debian), index building crashes:

~/Software/kallisto-0.42.3/test$ ~/Software/kallisto/build/src/kallisto index -i transcripts.idx transcripts.fasta.gz

[build] loading fasta file transcripts.fasta.gz [build] k-mer length: 31 [build] counting k-mers ... done. [build] building target de Bruijn graph ... done [build] creating equivalence classes ... kallisto: /home/inken/Software/tmp/kallisto/src/KmerIndex.cpp:588: void KmerIndex::FixSplitContigs(const ProgramOptions&, std::vectorstd::vector&): Assertion `search != kmap.end()' failed. Aborted

Since I am on a 32-bit Linux, I unfortunately cannot use the binaries. I guess eventually I want to do this (?), which also crashes:

~/Software/kallisto/build/src/kallisto index -i Homo_sapiens.GRCh38.rel79.cdna.all.fa.idx Homo_sapiens.GRCh38.rel79.cdna.all.fa.gz

[build] loading fasta file Homo_sapiens.GRCh38.rel79.cdna.all.fa.gz [build] k-mer length: 31 [build] warning: clipped off poly-A tail (longer than 10) from 1372 target sequences [build] warning: replaced 85 non-ACGUT characters in the input sequence with pseudorandom nucleotides [build] counting k-mers ... terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc Aborted

Is the second perhaps an issue with the 32-bit architecture?

Thanks for your help! Inken

pmelsted commented 8 years ago

Duplicate of #42 .

Indexing is known to not work on 32-bit systems.

maximilianh commented 8 years ago

You could use #if SIZEOF_POINTER == 8 (gcc only) to check if the user is trying to compile on 32bit and output a warning message if this is the case.

ashishdamania commented 8 years ago

I am getting this error on 64 bit computer with 64GB of ram. I have tried k-mer 31,25,21 and it results in same error.

deepthoughts@cvl-microbiome:/mnt/microbiome/kallisto_linux-v0.42.4$ ./kallisto index -i metagenome_database -k 21 ../Martin_etal_TextS3_13Dec2011.fasta

[build] loading fasta file ../Martin_etal_TextS3_13Dec2011.fasta
[build] k-mer length: 21
[build] warning: clipped off poly-A tail (longer than 10)
        from 146 target sequences
[build] warning: replaced 15388190 non-ACGUT characters in the input sequence
        with pseudorandom nucleotides
[build] counting k-mers ... terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted (core dumped)

Here is my computer info:

deepthoughts@cvl-microbiome:/mnt/microbiome/kallisto_linux-v0.42.4$ uname -a
Linux cvl-microbiome 3.13.0-32-generic #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
ashishdamania commented 8 years ago

Just read this post and comment: https://liorpachter.wordpress.com/2015/10/27/straining-metagenomics/


November 25, 2015 at 6:44 pm Lorian Schaeffer

Thanks for your questions! You’re completely right, we left out a few important steps in FASTA handling; they’ll be in the next revision of the paper. In short, we dropped viral and eukaryote genomes, then dropped 6 genomes that didn’t have sequence GI numbers (because both Kraken and CLARK require them), and dropped one header that was empty of actual sequence (gi|308222630). This resulted in 1,858 different strains, spread over 182,505 fasta entries. If your post-processing counts are significantly different from this, feel free to email me at my full name at gmail.com and we can figure out what’s missing.

For your second question, the index was built at the default of k=31. Right now, large indexes require a very large amount of RAM — I built the RGD index on a server with 430GB of RAM — but that’s going to change very soon; Páll Melsted is testing an indexer that’s significantly lighter weight, and should greatly reduce the RAM requirements of metagenomic indexes.


I guess it will require lot of ram.

AntonioGPS commented 3 years ago

Using latest 0.46.1 version with 32Gb RAM I got the kallisto index std::bad_alloc error in processing a 346Mb fasta.gz file

Is there any current solution to this that does not require to install a higher amount of RAM memory ?