kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Execution failed due to high FPR in case #368

Closed pkothiyal closed 4 years ago

pkothiyal commented 4 years ago

Hello,

We are testing kevlar on a single human trio for de novo mutation calling. We originally had Illumina BAM files which we converted to fastq to run kevlar. This is high coverage data (~100x) and I don't believe any error correction was performed. I ran the test using the example snakemake workflow and kept all parameters the same except assigned 12G memory to each trio member. I used 64 cores for the job.

Looks like the job is failing due to high FPR in the case (~0.95). Before I play around with max_fpr setting for case, I wanted to see if I'm missing something more fundamental here. Looking at the config file, I just noticed that I should also change mean coverage from 30 to 100!

Would appreciate any feedback on how I can complete this test run. Thank you for your help. Prachi

Error log:

[Mon Oct 21 13:28:42 2019]
Job 3: Count k-mers in the case sample

kevlar --tee --logfile Logs/casecount.log count --ksize 31 --memory 12G --max-fpr 0.3 --threads 32 Sketches/case-counts.counttable --mask Mask/mask.nodetable Reads/case.inseq.0.fastq
[kevlar] running version 0.7
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "Reads/ctrl0.inseq.0.fastq"
[kevlar::count] Done loading k-mers;
    2392573742 reads processed, 6087102444 distinct k-mers stored;
    estimated false positive rate is 0.947 (FPR too high, bailing out!!!)
[Mon Oct 21 19:12:14 2019]
Error in rule count_control:
    jobid: 0
    output: Sketches/ctrl0-counts.counttable, Logs/ctrl0count.log

RuleException:
CalledProcessError in line 243 of /DNM/src/kevlar/Snakefile:
Command 'set -euo pipefail;  kevlar --tee --logfile Logs/ctrl0count.log count --ksize 31 --memory 12G --max-fpr 0.05 --mask Mask/mask.nodetable --threads 32 Sketches/ctrl0-counts.counttable Reads/ctrl0.inseq.0.fastq' returned non-zero exit status 1.
  File "/DNM/src/kevlar/Snakefile", line 243, in __rule_count_control
  File "/Resources/Tools/python3/3.7.0-shared/lib/python3.7/concurrent/futures/thread.py", line 57, in run
Exiting because a job execution failed. Look above for error message
[kevlar] running version 0.7
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "Reads/case.inseq.0.fastq"
[kevlar::count] Done loading k-mers;
    2095006280 reads processed, 6095343019 distinct k-mers stored;
    estimated false positive rate is 0.950 (FPR too high, bailing out!!!)
[Mon Oct 21 20:07:54 2019]
Error in rule count_case:
    jobid: 3
    output: Sketches/case-counts.counttable, Logs/casecount.log

RuleException:
CalledProcessError in line 228 of /DNM/src/kevlar/Snakefile:
Command 'set -euo pipefail;  kevlar --tee --logfile Logs/casecount.log count --ksize 31 --memory 12G --max-fpr 0.3 --threads 32 Sketches/case-counts.counttable --mask Mask/mask.nodetable Reads/case.inseq.0.fastq' returned non-zero exit status 1.
  File "/DNM/src/kevlar/Snakefile", line 228, in __rule_count_case
  File "//Resources/Tools/python3/3.7.0-shared/lib/python3.7/concurrent/futures/thread.py", line 57, in run
Removing output files of failed job count_case since they might be corrupted:
Logs/casecount.log
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message 
standage commented 4 years ago

Greetings @pkothiyal.

A 100x coverage dataset is going to require much more than 12G of memory per sample. If you don't have a machine with hundreds of gigabytes of RAM, you'll need to do some error correction on your reads, and perhaps even consider downsampling to 40x-50x coverage. The kevlar documentation has additional recommendations.

I hope this helps!

pkothiyal commented 4 years ago

Thanks @standage

I'll try it on a down-sampled trio first.

However, the error message I saw earlier does not mention low memory and lists high FPR as the bottleneck. Can I run the workflow on 100x coverage with 64G per sample and expect to not see the high FPR error or could that be related to something other than allocated memory? Thanks.

standage commented 4 years ago

However, the error message I saw earlier does not mention low memory and lists high FPR as the bottleneck.

I think you're failing to see the relationship between memory and accuracy (as measured by FPR) described in the documentation I linked in my previous comment. Insufficient memory is the reason the FPR is too high. There are two approaches to fixing this problem: 1) reduce the information content of the reads, or 2) increase memory. Most of the k-mers in a high coverage data set will be the product of a sequencing error. Correcting these errors will drastically reduce the number of k-mers that have to be tracked/queried, and thus the memory required for high accuracy (low FPR). On the other hand, if your machine has lots of RAM you can simply allocate a larger chunk for each sample to increase the k-mer count accuracy (reduce the FPR).

pkothiyal commented 4 years ago

Thank you - that makes sense. I'll run it on our cluster with more memory per sample and will also test it on down-sampled data.