kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

count_control error: estimated false positive rate is 0.385 (FPR too high, bailing out!!! #385

Open moldach opened 3 years ago

moldach commented 3 years ago

After running successfully through the example dataset I've ran kevlar on my own data but am getting an error at the count_control step, that the FPR is too high.

I'm trying to figure out

Error log

Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 8
Rules claiming more threads will be scaled down.
Job counts:
    count   jobs
    16  assemble
    16  call
    1   calls
    1   count_case
    2   count_control
    1   count_reference
    1   create_mask
    1   filter_novel
    1   like_scores
    1   link_input_seqs
    1   link_mask
    1   link_reference
    1   localize
    1   novel
    1   partition
    1   split
    47
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")

[Thu Oct 22 10:46:50 2020]
Job 24: Create internal links for sample sequence data.

[Thu Oct 22 10:46:50 2020]
Job 42: Create internal links for mask sequence data.

[Thu Oct 22 10:46:50 2020]
Job 22: Create internal links for reference genome, and index if needed.

Job counts:
    count   jobs
    1   link_input_seqs
    1
Job counts:
    count   jobs
    1   link_mask
    1
Job counts:
    count   jobs
    1   link_reference
    1
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
[Thu Oct 22 10:46:51 2020]
Finished job 42.
1 of 47 steps (2%) done
[Thu Oct 22 10:46:51 2020]
Finished job 24.
2 of 47 steps (4%) done
[Thu Oct 22 10:46:51 2020]
Finished job 22.
3 of 47 steps (6%) done

[Thu Oct 22 10:46:51 2020]
Job 2: Count k-mers in the reference genome.

kevlar --tee --logfile Logs/refrcount.log count --ksize 31 --counter-size 4 --memory 12G --max-fpr 0.025 --threads 8 Reference/refr-counts.smallcounttable Reference/Homo_sapiens.GRCh37.dna.toplevel.fa
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a small count table, a CountMin sketch with a counter size of 4 bits, for k-mer abundance queries (max abundance 15)
[kevlar::count] - processing "Reference/Homo_sapiens.GRCh37.dna.toplevel.fa"
[kevlar::count] Done loading k-mers;
    297 reads processed, 2486108683 distinct k-mers stored;
    estimated false positive rate is 0.013;
    saved to "Reference/refr-counts.smallcounttable"
[kevlar::count] Total time: 17203.38 seconds
[Thu Oct 22 15:33:50 2020]
Finished job 2.
4 of 47 steps (9%) done

[Thu Oct 22 15:33:50 2020]
Job 23: Generate a mask of sequences to ignore while k-mer counting.

kevlar --tee --logfile Logs/mask.log count --ksize 31 --counter-size 1 --memory 6G --max-fpr 0.005 --threads 8 Mask/mask.nodetable Mask/Homo_sapiens.GRCh37.dna.toplevel.fa
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a node table (Bloom filter) for k-mer presence/absence queries
[kevlar::count] - processing "Mask/Homo_sapiens.GRCh37.dna.toplevel.fa"
[kevlar::count] Done loading k-mers;
    297 reads processed, 2493095857 distinct k-mers stored;
    estimated false positive rate is 0.001;
    saved to "Mask/mask.nodetable"
[kevlar::count] Total time: 5911.40 seconds
[Thu Oct 22 17:12:24 2020]
Finished job 23.
5 of 47 steps (11%) done

[Thu Oct 22 17:12:24 2020]
Job 5: Count k-mers in a control sample

Job counts:
    count   jobs
    1   count_control
    1
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
kevlar --tee --logfile Logs/ctrl1count.log count --ksize 31 --memory 16G --max-fpr 0.05 --mask Mask/mask.nodetable --threads 8 Sketches/ctrl1-counts.counttable Reads/ctrl1.inseq.0.fastq.gz Reads/ctrl1.inseq.1.fastq.gz
[kevlar] running version 0.7+15.gebabd62
[kevlar::count] Storing k-mers in a count table, a CountMin sketch with a counter size of 8 bits, for k-mer abundance queries (max abundance 255)
[kevlar::count] - processing "Reads/ctrl1.inseq.0.fastq.gz"
[kevlar::count] - processing "Reads/ctrl1.inseq.1.fastq.gz"
[kevlar::count] Done loading k-mers;
    851849754 reads processed, 5429363124 distinct k-mers stored;
    estimated false positive rate is 0.385 (FPR too high, bailing out!!!)
[Thu Oct 22 19:34:15 2020]
Error in rule count_control:
    jobid: 0
    output: Sketches/ctrl1-counts.counttable, Logs/ctrl1count.log

RuleException:
CalledProcessError in line 243 of /gpfs/home/moldach/projects/CG00018/Snakefile:
Command 'set -euo pipefail;  kevlar --tee --logfile Logs/ctrl1count.log count --ksize 31 --memory 16G --max-fpr 0.05 --mask Mask/mask.nodetable --threads 8 Sketches/ctrl1-counts.counttable Reads/ctrl1.inseq.0.fastq.gz Reads/ctrl1.inseq.1.fastq.gz' returned non-zero exit status 1.
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
  File "/gpfs/home/moldach/projects/CG00018/Snakefile", line 243, in __rule_count_control
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/home/moldach/miniconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/home/moldach/projects/CG00018/.snakemake/log/2020-10-22T104648.700597.snakemake.log

config.json

{
    "ksize": 31,
    "recountmem": "1G",
    "numsplit": 16,
    "samples": {
        "casemin": 6,
        "ctrlmax": 1,
        "case": {
            "fastx": [
                "/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018P_R1.fastq.gz",
                "/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018P_R2.fastq.gz"
            ],
            "memory": "16G",
            "label": "Proband",
            "max_fpr": 0.3
        },
    "controls": [
            {
                "fastx": [
                    "/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018M_R1.fastq.gz",
                    "/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018M_R2.fastq.gz"
                ],
                "memory": "16G",
                "label": "Mother",
                "max_fpr": 0.05
            },
            {
                "fastx": [
                    "/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018F_R1.fastq.gz",
                    "/tiered/mtgraovac/CombGene/CG00018/FASTQ/CG00018F_R2.fastq.gz"
                ],
                "memory": "16G",
                "label": "Father",
                "max_fpr": 0.05
            }
    ],
    "coverage": {
            "mean": 30.0,
            "stdev": 10.0
        }
    },
    "mask": {
    "fastx": [
            "/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa"
        ],
    "memory": "6G",
        "max_fpr": 0.005
    },
    "reference": {
        "fasta": "/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa",
        "memory": "12G",
        "max_fpr": 0.025
    },
    "localize": {
        "seedsize": 51,
        "delta": 50,
        "seqpattern": ".",
        "maxdiff": 10000
    },
    "varfilter": null
}

Submission

#!/bin/bash
#BSUB -q normal
#BSUB -J kevlar
#BSUB -R "rusage[mem=16G]"
#BSUB -n 8
#BSUB -M 16000
#BSUB -W 600:00
#BSUB -u moldach@ucalgary.ca
#BSUB -R "select[hname!=node013]"
#BSUB -B
#BSUB -N
#BSUB -o kevlar_CG00018.out
#BSUB -e kevlar_CG00018.err

source ~/kavlar-test/kevlar-env/bin/activate
snakemake --snakefile Snakefile --configfile config.json --cores 8 --directory ./ -p calls
standage commented 3 years ago

Hi @moldach! The workflow is failing at a k-mer counting step due to insufficient memory. There are a few ways to address this.

  1. If memory is abundant on your machine, you could simply increase the amount of memory allocated to counting k-mers in each case/proband and control/parent sample.
  2. Alternatively, you could use a tool like Lighter to do error correction* on the reads before running Kevlar. The amount of memory required for counting k-mers accurately depends on the number of distinct k-mers in a data set: sequencing errors often account for the majority of k-mers in a sequencing run, so eliminating those errors will bring the false positive rate down significantly.
  3. Another alternative is to increase the tolerance for error (max_fpr) in some samples. I'd recommend limiting the parents' FPRs to the default 0.05, but I've had decent success while relaxing max_fpr to >0.3 for case/proband samples.

None of these solutions is mutually exclusive: you could increase memory AND do error correction AND increase the max_fpr for the controls. 1) and 3) would be the quickest to try, but only if you have access to a machine with sufficient memory. Note that at some steps of the workflow, all case + control + reference k-mer counts are loaded into memory simultaneously. With your current setup, that looks like 16 + (16 + 16) + 12 GB of memory.


*Error correction for low-coverage reads is challenging, and there were a few instances in which Lighter erroneously "corrected" reads that contained an actual (low coverage) variant rather than a sequencing error. But depending on the constraints of the system to which one has access, missing 1 or 2 variants out of 90-100 is worth the reduction in memory required for k-mer counting.

moldach commented 3 years ago

Hi @standage thank you for getting back to me.

I do have abundant memory so I increased everywhere it said "memory" in the config.json file to 80G - as this seems to be how rules are given memory in the Snakefile.

This time the pipeline ran through 45 of 47 steps before failing with the following error:

[Sun Nov  1 17:19:16 2020]
Finished job 11.
45 of 47 steps (96%) done

[Sun Nov  1 17:19:16 2020]
Job 1: Filter calls, compute likelihood scores, and sort calls by score.

Job counts:
    count   jobs
    1   like_scores
    1
/home/moldach/miniconda3/lib/python3.7/site-packages/pulp/pulp.py:1195: UserWarning: Spaces are not permitted in the name. Converted to '_'
  warnings.warn("Spaces are not permitted in the name. Converted to '_'")
kevlar --tee --logfile Logs/simlike.log simlike --mu 30.0 --sigma 10.0 --epsilon 0.001 --case-min 6 --refr Reference/refr-counts.smallcounttable --sample-labels Proband Mother Father --out calls.scored.sorted.vcf.gz --controls Sketches/ctrl0-counts.counttable Sketches/ctrl1-counts.counttable --case Sketches/case-counts.counttable calls.0.prelim.vcf.gz calls.1.prelim.vcf.gz calls.2.prelim.vcf.gz calls.3.prelim.vcf.gz calls.4.prelim.vcf.gz calls.5.prelim.vcf.gz calls.6.prelim.vcf.gz calls.7.prelim.vcf.gz calls.8.prelim.vcf.gz calls.9.prelim.vcf.gz calls.10.prelim.vcf.gz calls.11.prelim.vcf.gz calls.12.prelim.vcf.gz calls.13.prelim.vcf.gz calls.14.prelim.vcf.gz calls.15.prelim.vcf.gz
[kevlar] running version 0.7+15.gebabd62
[kevlar::simlike] Loading k-mer counts for each sample
Traceback (most recent call last):
  File "/export/home/moldach/kavlar-test/kevlar-env/bin/kevlar", line 33, in <module>
    sys.exit(load_entry_point('biokevlar==0.7+15.gebabd62', 'console_scripts', 'kevlar')())
  File "/export/home/moldach/kavlar-test/kevlar-env/lib/python3.8/site-packages/kevlar/__main__.py", line 30, in main
    mainmethod(args)
  File "/export/home/moldach/kavlar-test/kevlar-env/lib/python3.8/site-packages/kevlar/simlike.py", line 363, in main
    refr = kevlar.sketch.load(args.refr)
  File "/export/home/moldach/kavlar-test/kevlar-env/lib/python3.8/site-packages/kevlar/sketch.py", line 92, in load
    return loadfunc(filename)
  File "khmer/_oxli/graphs.pyx", line 306, in khmer._oxli.graphs.Hashtable.load
OSError: Error reading from k-mer count file: Reference/refr-counts.smallcounttable Cannot allocate memory
[Sun Nov  1 17:23:40 2020]
Error in rule like_scores:
    jobid: 0
    output: calls.scored.sorted.vcf.gz, Logs/simlike.log

RuleException:
CalledProcessError in line 375 of /gpfs/home/moldach/projects/CG00018/Snakefile:
Command 'set -euo pipefail;  kevlar --tee --logfile Logs/simlike.log simlike --mu 30.0 --sigma 10.0 --epsilon 0.001 --case-min 6 --refr Reference/refr-counts.smallcounttable --sample-labels Proband Mother Father --out calls.scored.sorted.vcf.gz --controls Sketches/ctrl0-counts.counttable Sketches/ctrl1-counts.counttable --case Sketches/case-counts.counttable calls.0.prelim.vcf.gz calls.1.prelim.vcf.gz calls.2.prelim.vcf.gz calls.3.prelim.vcf.gz calls.4.prelim.vcf.gz calls.5.prelim.vcf.gz calls.6.prelim.vcf.gz calls.7.prelim.vcf.gz calls.8.prelim.vcf.gz calls.9.prelim.vcf.gz calls.10.prelim.vcf.gz calls.11.prelim.vcf.gz calls.12.prelim.vcf.gz calls.13.prelim.vcf.gz calls.14.prelim.vcf.gz calls.15.prelim.vcf.gz' returned non-zero exit status 1.
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2189, in run_wrapper
  File "/gpfs/home/moldach/projects/CG00018/Snakefile", line 375, in __rule_like_scores
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/home/moldach/miniconda3/lib/python3.7/concurrent/futures/thread.py", line 57, in run
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/home/moldach/miniconda3/lib/python3.7/site-packages/snakemake/executors/__init__.py", line 2201, in run_wrapper
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /gpfs/home/moldach/projects/CG00018/.snakemake/log/2020-10-29T164206.431603.snakemake.log

I'd like to be able to re-submit the job with more memory to finish these last two steps but I'm not sure which part of the config.json I should be adjusting the memory for.

Also, when I tried to re-submit again I got the following error:

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
('/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa', link_reference)
('/tiered/mtgraovac/indexes/GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa', link_mask)
standage commented 3 years ago

OSError: Error reading from k-mer count file: Reference/refr-counts.smallcounttable Cannot allocate memory

This means you ran out of memory on the machine: it cannot hold all the k-mer count tables in memory. Using 80GB for the reference and the mask file is overkill. Those can (and according to this error, probably should) be kept at their original values. If you delete the mask and reference counttables/nodetables, you should be able to rebuild them and continue with the workflow without the need to start over again from scratch.

moldach commented 3 years ago

Sorry I should have asked earlier.

Was it only "recountmem" which needed to be increased?

Seems like the run completed successfully with your suggestion - so, to confirm, I'll be looking at the result in calls.scored.sorted.vcf.gz?

Thanks

standage commented 3 years ago

Was it only recountmem which needed to be increased?

Actually, not much memory is required for recounting. Increasing the memory for each case and control sample would have been recommended.

so, to confirm, I'll be looking at the result in calls.scored.sorted.vcf.gz?

Yep, that's the one!