mcveanlab / mccortex

De novo genome assembly and multisample variant calling
https://github.com/mcveanlab/mccortex/wiki
MIT License
113 stars 25 forks source link

mccortex unitigs: "Fatal Error: Not enough kmers in hash" or "Fatal Error: Hash table is full" #90

Open karel-brinda opened 3 years ago

karel-brinda commented 3 years ago

Preparation:

$ wget http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/ctx/ERR189/ERR189737/cleaned/ERR189737.ctx.bz2

$ bzip2 -d -k ERR189737.ctx.bz2

Failure mode 1:

$ mccortex31 unitigs ERR189737.ctx
[10 Sep 2020 23:21:08-DAF][cmd] mccortex31 unitigs ERR189737.ctx
[10 Sep 2020 23:21:08-DAF][cwd] /private/tmp/~20200910223252
[10 Sep 2020 23:21:08-DAF][version] mccortex=v0.0.3-610-g400c0e3 zlib=1.2.11 htslib=1.8-17-g699ed53 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[10 Sep 2020 23:21:08-DAF][memory] 73 bits per kmer
[10 Sep 2020 23:21:08-DAF][cmd_mem.c:98] Fatal Error: Not enough kmers in hash: require at least 70,540,096 kmers (min memory: 624.5MB)
Karel:~20200910223252 karel$ mccortex31 unitigs ERR189737.ctx
[10 Sep 2020 23:21:18-fOD][cmd] mccortex31 unitigs ERR189737.ctx
[10 Sep 2020 23:21:18-fOD][cwd] /private/tmp/~20200910223252
[10 Sep 2020 23:21:18-fOD][version] mccortex=v0.0.3-610-g400c0e3 zlib=1.2.11 htslib=1.8-17-g699ed53 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[10 Sep 2020 23:21:18-fOD][memory] 73 bits per kmer
[10 Sep 2020 23:21:18-fOD][cmd_mem.c:98] Fatal Error: Not enough kmers in hash: require at least 70,540,096 kmers (min memory: 624.5MB)

Failure mode 2:

$ bzcat -f ERR189737.ctx.bz2 |  mccortex31 unitigs -
[11 Sep 2020 12:28:09-fIt][cmd] mccortex31 unitigs -
[11 Sep 2020 12:28:09-fIt][cwd] /private/tmp/~20200910223252
[11 Sep 2020 12:28:09-fIt][version] mccortex=v0.0.3-610-g400c0e3 zlib=1.2.11 htslib=1.8-17-g699ed53 ASSERTS=ON hash=Lookup3 CHECKS=ON k=3..31
[11 Sep 2020 12:28:09-fIt][memory] 73 bits per kmer
[11 Sep 2020 12:28:09-fIt][memory] graph: 496.8MB
[11 Sep 2020 12:28:09-fIt][memory] total: 496.8MB of 40GB RAM
[11 Sep 2020 12:28:09-fIt] Output in FASTA format to STDOUT
[11 Sep 2020 12:28:09-fIt][hasht] Allocating table with 56,623,104 entries, using 436MB
[11 Sep 2020 12:28:09-fIt][hasht]  number of buckets: 2,097,152, bucket size: 27
[11 Sep 2020 12:28:09-fIt][graph] kmer-size: 31; colours: 1; capacity: 56,623,104
[11 Sep 2020 12:28:09-fIt][FileFilter] Reading file - [1 src colour]
[11 Sep 2020 12:28:09-fIt][GReader] 18,446,744,073,709,551,615 kmers, 16EB filesize
^[[B^[[B^[[B^[[B^[[B^[[B[11 Sep 2020 12:28:50-fIt][hasht] buckets: 2,097,152 [2^21]; bucket size: 27; 
[11 Sep 2020 12:28:50-fIt][hasht] memory: 436MB; filled: 51,626,922 / 56,623,104 (91.18%)
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  0: 49009867
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  1: 1927184
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  2: 462390
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  3: 144851
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  4: 50724
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  5: 19183
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  6: 7551
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  7: 2960
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  8: 1266
[11 Sep 2020 12:28:50-fIt][hasht]  collisions  9: 497
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 10: 276
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 11: 102
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 12: 38
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 13: 21
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 14: 9
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 15: 2
[11 Sep 2020 12:28:50-fIt][hasht]  collisions 16: 1
[11 Sep 2020 12:28:50-fIt][hash_table.c:247] Fatal Error: Hash table is full
karel-brinda commented 3 years ago

It might be related to https://github.com/mcveanlab/mccortex/issues/89.

karel-brinda commented 3 years ago

Other experiments revealed that adding -m 20G helps; I previously didn't know that this parameter should be used for the unitigs subcommand too.

Maybe changing the error message Fatal Error: Hash table is full to something more informative would help?