kevlar-dev / kevlar

Reference-free variant discovery in large eukaryotic genomes
https://kevlar.readthedocs.io
MIT License
40 stars 9 forks source link

Clarification - how to avoid high FPR #377

Open exeter-matthew-wakeling opened 4 years ago

exeter-matthew-wakeling commented 4 years ago

I'm trying to run Kevlar on a human trio, sequenced on a BGI-Seq 500, so the data has a fairly low error rate, but the reads are not error-corrected. For some inexplicable reason we have been given about 50X-worth of data when we expected 30X. I am getting some errors for too-high FPR. I ran:

kevlar count --memory 60000M b37_mask.ct Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa.gz UniVec.fasta human_g1k_v37.fasta
kevlar count --memory 60000M proband.ct proband_R1.fastq.gz proband_R2.fastq.gz
kevlar count --memory 60000M father.ct father_R1.fastq.gz father_R2.fastq.gz
kevlar count --memory 60000M mother.ct mother_R1.fastq.gz mother_R2.fastq.gz
kevlar novel --case proband_R1.fastq.gz proband_R2.fastq.gz --case-counts proband.ct --control-counts father.ct mother.ct -o proband_novel.augfastq
kevlar filter --mask b37_mask.ct -o proband_filtered.augfastq proband_novel.augfastq

The kevlar filter stage fails with a high FPR. The documentation states that a FPR of <0.5 for control samples and <0.05 for case samples are required, and that this can be achieved for uncorrected reads with 36-72GB of space with a suitable mask. I have tried the above commands for three trios so far, and the FPR rates are as follows:

  1. proband: 0.004, father: 0.053, mother: 0.005, filter: 0.123
  2. proband: 0.046, father: 0.007, mother: 0.047, filter: 0.971
  3. proband: 0.006, father: 0.019, mother: 0.005, filter: 0.629

So, the FPR rates for the samples are under the thresholds stated in the documentation, but the FPR rate is reported much higher by the filter command, and the filter command produces an error message.

I have applied the --mask option to the filter command. I just noticed that the count command also has a --mask option - should I be giving the same mask to the count command as well? The documentation doesn't really make this clear. Should I allocate even more memory to the count process?

I would greatly appreciate being set in the right direction on this one, but I think the documentation could also do with clarification. Many thanks.

standage commented 4 years ago

With respect to using the mask in the kevlar count command with or instead of the kevlar filter command, I agree that the docs could use some clarification. But yes, using the mask at the counting stage will decrease your FPR.

Another option—one that is mentioned in the docs, by the way—is to perform error correction. In my experience, this leads to a dramatic reduction in memory requirements/FPR, and the only side effect is a small loss of sensitivity (1-2 low-coverage SNPs are erroneously treated as sequencing errors and "corrected"). In kevlar's recommended BAM preprocessing workflow we use the Lighter error corrector.

exeter-matthew-wakeling commented 4 years ago

Ok, I have tried running the software with a mask at the count stage. The commands have been:

kevlar count --memory 60000M b37_mask.ct Escherichia_coli_str_k_12_substr_mg1655.ASM584v2.dna.toplevel.fa.gz UniVec.fasta human_g1k_v37.fasta
kevlar count --memory 70000M --mask b37_mask.ct proband.ct proband_R1.fastq.gz proband_R2.fastq.gz
kevlar count --memory 70000M --mask b37_mask.ct father.ct father_R1.fastq.gz father_R2.fastq.gz
kevlar count --memory 70000M --mask b37_mask.ct mother.ct mother_R1.fastq.gz mother_R2.fastq.gz
kevlar novel --case proband_R1.fastq.gz proband_R2.fastq.gz --case-counts proband.ct --control-counts father.ct mother.ct -o proband_novel.augfastq
kevlar filter --mask b37_mask.ct -o proband_filtered.augfastq proband_novel.augfastq

I have run 5 trios. The FPR rates have been as follows:

  1. Proband: 0.000, father: 0.002, mother: 0.006, filter: 0.647
  2. Proband: 0.000, father: 0.003, mother: 0.000, filter: 0.621
  3. Proband: 0.000, father: 0.014, mother: 0.000, filter: 0.125
  4. Proband: 0.011, father: 0.000, mother: 0.011, filter: 0.957
  5. Proband: 0.000, father: 0.001, mother: 0.000, filter: 0.135

I'm going to try performing error correction on the fastq files, and see if that makes any difference, however, it seems that the kevlar count outputs already have an extremely low FPR. Is it usual for the kevlar filter command to report a much higher FPR than the kevlar count data it uses? The documentation states that a much higher error rate of up to 0.05 would be acceptable, but that doesn't seem to be working for me.

standage commented 4 years ago

Is it usual for the kevlar filter command to report a much higher FPR than the kevlar count data it uses?

kevlar filter re-computes k-mer counts for k-mers marked as "novel" by kevlar novel. This step requires much less memory than the initial step, since we're looking at many fewer candidate k-mers. However, the default memory for kevlar filter is 1 Mb—try running it with --memory 1G, which should be more than enough.