skovaka / UNCALLED

Raw nanopore signal mapper that enables real-time targeted sequencing
MIT License
520 stars 44 forks source link

MemoryError: std::bad_alloc #37

Closed hasindu2008 closed 2 years ago

hasindu2008 commented 2 years ago

Thank you for writing UNCALLED. It was a seamless install and worked quite well with small genomes.

Not sure if I am doing something I am not supposed to do :D - I was trying to run uncalled index on chr22, however, I ran into a MemoryError: std::bad_alloc. Seems like it runs out of memory even on a 380GB server. Could it be due to a memory leak or something?

(uncalled) hasindu@genometech-gpgpu:/data/hasindu$ /usr/bin/time -v uncalled index chr22.fa
Using previously built BWA index.
Note: to fully re-build the index delete files with the "chr22.fa.*" prefix.
Initializing parameter search
Traceback (most recent call last):
  File "/home/hasindu/uncalled/bin/uncalled", line 339, in <module>
    index_cmd(args)
  File "/home/hasindu/uncalled/bin/uncalled", line 56, in index_cmd
    p = unc.index.IndexParameterizer(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 62, in __init__
    self.calc_map_stats(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 82, in calc_map_stats
    fmlens = unc.self_align(args.bwa_prefix, sample_dist)
MemoryError: std::bad_alloc
Command exited with non-zero status 1
        Command being timed: "uncalled index chr22.fa"
        User time (seconds): 6625.89
        System time (seconds): 770.15
        Percent of CPU this job got: 99%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 2:03:20
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 380383136
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 481
        Minor (reclaiming a frame) page faults: 254323991
        Voluntary context switches: 1470
        Involuntary context switches: 408568
        Swaps: 0
        File system inputs: 286888
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 1
skovaka commented 2 years ago

Sorry for the trouble! It looks like it failed when trying to determine the repetitiveness of the reference via self-alignment, which could fail if the genome is too repetitive. Did you run any repeat masking, namely the "internal masking" described here? You could also try reducing the number of alignments checked via the "--max-samples" option. Try adding --max-samples 100000 to the index command (the default is 1000000).

hasindu2008 commented 2 years ago

I please check if I did it right. I am still getting the same error:

$ ./mask_internal.sh chr22.fa 10 30 chr22_internal_masked.fa
Iteration 0: masked 51812 occurences of TTTTTTTTTT
Iteration 1: masked 51562 occurences of AAAAAAAAAA
Iteration 2: masked 9296 occurences of TGTAATCCCA
Iteration 3: masked 9198 occurences of TGGGATTACA
Iteration 4: masked 8816 occurences of GGAGGCTGAG
Iteration 5: masked 8752 occurences of CTCAGCCTCC
Iteration 6: masked 7462 occurences of CACACACACA
Iteration 7: masked 7046 occurences of CCCAGGCTGG
Iteration 8: masked 6902 occurences of CCAGCCTGGG
Iteration 9: masked 6797 occurences of TGTGTGTGTG
Iteration 10: masked 6211 occurences of TGCAGTGAGC
Iteration 11: masked 6135 occurences of GCTCACTGCA
Iteration 12: masked 6018 occurences of AAAATACAAA
Iteration 13: masked 5924 occurences of TTTGTATTTT
Iteration 14: masked 4984 occurences of TCCCAAAGTG
Iteration 15: masked 4982 occurences of AGTGCAGTGG
Iteration 16: masked 4922 occurences of CCACTGCACT
Iteration 17: masked 4912 occurences of CACTTTGGGA
Iteration 18: masked 4707 occurences of GCAGGAGAAT
Iteration 19: masked 4690 occurences of AGGTCAGGAG
Iteration 20: masked 4621 occurences of ATTCTCCTGC
Iteration 21: masked 4536 occurences of CTCCTGACCT
Iteration 22: masked 4095 occurences of TATATATATA
Iteration 23: masked 4022 occurences of AGACCAGCCT
Iteration 24: masked 4013 occurences of AGGCTGGTCT
Iteration 25: masked 3850 occurences of CCCAGCTACT
Iteration 26: masked 3757 occurences of GGTGAAACCC
Iteration 27: masked 3713 occurences of AGTAGCTGGG
Iteration 28: masked 3654 occurences of GGGTTTCACC
Iteration 29: masked 3339 occurences of GTCTCTACTA
$uncalled index chr22_internal_masked.famask30.fa --max-samples 100000
[bwa_index] Pack FASTA... 0.53 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=101636936, availableWord=19151484
[BWTIncConstructFromPacked] 10 iterations done. 31590664 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 58359704 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 82148056 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 101636936 characters processed.
[bwt_gen] Finished constructing BWT in 40 iterations.
[bwa_index] 37.85 seconds elapse.
[bwa_index] Update BWT... 0.30 sec
[bwa_index] Pack forward-only FASTA... 0.39 sec
[bwa_index] Construct SA from BWT and Occ... 11.30 sec
Initializing parameter search
Traceback (most recent call last):
  File "/home/hasindu/uncalled/bin/uncalled", line 339, in <module>
    index_cmd(args)
  File "/home/hasindu/uncalled/bin/uncalled", line 56, in index_cmd
    p = unc.index.IndexParameterizer(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 62, in __init__
    self.calc_map_stats(args)
  File "/home/hasindu/uncalled/lib/python3.6/site-packages/uncalled/index.py", line 82, in calc_map_stats
    fmlens = unc.self_align(args.bwa_prefix, sample_dist)
MemoryError: std::bad_alloc
Command exited with non-zero status 1
skovaka commented 2 years ago

It looks like the issue is the high frequency of Ns in the reference (chromosome 22 is 22% Ns!). BWA indexes are supposed to replace Ns with random bases, but it appears to repeat a pseudo-random sequence when that many Ns occur, which blows up the repeat finding process. The reference I have contains 50 bases per line, and I simply removed all lines which are entirely Ns, which fixed the problem. I will work on a less hacky way of fixing this soon.

hasindu2008 commented 2 years ago

OK I see. But if those are removed, then the mapping coordinates will change, right? And that will give bad results when I call uncalled pafstats?

I am also having a few queries about pafstats.

  1. Does it handle supplementary and secondary alignments in the truthset?
  2. What minimap2 parameters will you recommend to generate the truthset? I am thinking of doing minimap2 -c -x map-ont ref.fa reads.fastq --secondary=no > ref.paf
skovaka commented 2 years ago

Yes, unfortunately that will cause incorrect pafstats results, unless you realign the basecalled reads to the modified reference. Sorry, I know that's not ideal. I will work on a better solution.

We do consider supplementary alignments. We count an alignment as "true" if it matches any alignment in the truth set. We only used default parameters besides "-x map-ont". "-c" isn't required since we only consider the endpoints, and neither is "--secondary=no".