kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

MemoryError #367

Closed mvelinder closed 4 years ago

mvelinder commented 5 years ago

Trying to run kevlar... here's what's happening

$ kevlar novel --case HG002.GRCh38.2x250.bam.fq.gz --control HG003.GRCh38.2x250.bam.fq.gz --control HG004.GRCh38.2x250.bam.fq.gz -M 240G -t 48 -o HG002.GRCh38.2x250.bam.fq.gz.kevlar.novel
[kevlar] running version 0.7
[kevlar::novel] Loading control samples
[kevlar::count] - processing "HG003.GRCh38.2x250.bam.fq.gz"
[kevlar::count] Done loading k-mers;
    775080564 reads processed, 16288238472 distinct k-mers stored;
    estimated false positive rate is 0.003
[kevlar::count] - processing "HG004.GRCh38.2x250.bam.fq.gz"
[kevlar::count] Done loading k-mers;
    868593056 reads processed, 16656828271 distinct k-mers stored;
    estimated false positive rate is 0.003
[kevlar::novel] Control samples loaded in 44955.19 sec
[kevlar::novel] Loading case samples
Traceback (most recent call last):
  File "~/bin/miniconda3/envs/kevlar-env/bin/kevlar", line 10, in <module>
    sys.exit(main())
  File "~/bin/miniconda3/envs/kevlar-env/lib/python3.7/site-packages/kevlar/__main__.py", line 30, in main
    mainmethod(args)
  File "~/bin/miniconda3/envs/kevlar-env/lib/python3.7/site-packages/kevlar/novel.py", line 203, in main
    args.save_case_counts,
  File "~/bin/miniconda3/envs/kevlar-env/lib/python3.7/site-packages/kevlar/novel.py", line 72, in load_samples
    band=band, numthreads=numthreads,
  File "~/bin/miniconda3/envs/kevlar-env/lib/python3.7/site-packages/kevlar/count.py", line 35, in load_sample_seqfile
    smallcount=smallcount)
  File "~/bin/miniconda3/envs/kevlar-env/lib/python3.7/site-packages/kevlar/sketch.py", line 118, in allocate
    sketch = createfunc(ksize, target_tablesize, num_tables)
  File "khmer/_oxli/graphs.pyx", line 404, in khmer._oxli.graphs.Counttable.__cinit__
MemoryError: std::bad_alloc

Any help would be great. Thanks!

standage commented 5 years ago

Sorry this fell through the cracks!

How much memory does your machine have? One thing that may not be intuitive for a first-time user is that the -M flag specifies the amount of memory to use for each sample. So if your machine doesn't have 240x3Gb of free memory then you would expect to run into problems. It appears that is what has happened here.

From your false positive rates, it looks like you should be fine reducing the memory quite a bit and still be fine. I usually aim for an FPR of 0.1-0.2 for the control/parent samples and 0.2-0.4 for the case/proband sample (high FPR in the proband can be compensated for at the kevlar filter step). For 30-40x coverage data, this usually required 64-72Gb per sample. Doing error correction on the reads beforehand can reduce this drastically (the majority of unique k-mers span a sequencing error), but you might miss a few SNPs this way.

Hope this helps!

standage commented 5 years ago

Also, I'd recommend running kevlar count before kevlar novel so you don't have to re-count k-mers every time you run the novel k-mer discovery step.

In fact, I'd recommend this Snakemake workflow for executing the entire pipeline. Please feel free to ask if anything is unclear.

standage commented 4 years ago

Feel free to reopen this thread if you have any additional questions.