iqbal-lab / Mykrobe-predictor

Antibiotic resistance predictions in minutes on a laptop
Other
50 stars 19 forks source link

Suspicious read counts #8

Closed iqbal-lab closed 9 years ago

iqbal-lab commented 10 years ago

Running Mykrobe on this sample, sequenced to >400x, gives precisely 100 on both R and S alleles. Suspicious?

pwd /Net/banyan/data0/users/zam/projects/piddock/data/Piddock_F145$

command: ~/dev/git/myKrobe-predictor/bin/Mykrobe.predictor.staph --file WTCHG_141363_25.bam --install_dir ~/dev/git/myKrobe-predictor --method InSilicoOligos myKrobe.predictor for Staphylococcus, version 0.0.0.1 Command: /home/zam/dev/git/myKrobe-predictor/bin/Mykrobe.predictor.staph --file WTCHG_141363_25.bam --install_dir /home/zam/dev/git/myKrobe-predictor --method InSilicoOligos Wed Sep 10 01:15:49 2014 Wed Sep 10 01:15:49 2014 * Start time Wed Sep 10 01:15:49 2014 * Sample: UnknownSample * Species S.aureus Wed Sep 10 01:21:36 2014 * Antimicrobial susceptibility predictions Gentamicin S Penicillin S Methicillin S Trimethoprim S Erythromycin S FusidicAcid r Ciprofloxacin S Rifampicin N Tetracycline S Vancomycin S Mupirocin S Clindamycin S * Called Variants var R_cov S_cov fusA_P404Q 100 100 <<<<<<<<<<< * Called Genes var cov \ Virulence markers PVL negative Wed Sep 10 01:21:36 2014

Phelimb commented 10 years ago

cd ls /data0/users/zam/projects/piddock/cortex/results/binaries/cleaned/k31/F145.kmer31.q10cleaned_29.ctx > blist ls blist > clist ls ~/myKrobe-predictor/data/staph/antibiotics/*.fa > panel ~/tools/CORTEX/bin/cortex_var_31_c1 --colour_list clist --kmer_size 31 --mem_height 20 --mem_width 250 --align panel,no --align_input_format LIST_OF_FASTA --max_read_len 10000

ref_P404Q_fusA_1120520.ref_sub-2-fusA AATGACATTATCTTGGAATCAATGGAATTCCCAGAGCCAGTTATTCACTTATCAGTAGAGCCA ref_P404Q_fusA_1120520.ref_sub-2-fusA_colour_0_kmer_coverages 347 345 347 346 342 344 342 344 341 338 333 338 341 339 341 341 337 339 337 339 340 339 334 332 334 334 335 333 325 325 321 320 319

No sign of coverage on an alternate. Will step through in gdb.

Phelimb commented 10 years ago

If we looked at the best sus or the best res on it's own we would call S and R respectively...

Why are the median coverages so low even with 100 percent_nonzero?

(gdb) p abi->vars[i]->vob_best_sus->susceptible_allele {median_covg = 1, median_covg_on_nonzero_nodes = 0, min_covg = 1, percent_nonzero = 100} (gdb) p abi->vars[i]->vob_best_sus->resistant_alleles {{median_covg = 1, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 56}, {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 25}, {median_covg = 0, \ median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 0} <repeats 58 times>}

(gdb) p abi->vars[i]->vob_best_res->susceptible_allele {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 25} (gdb) p abi->vars[i]->vob_best_res->resistant_alleles {{median_covg = 2, median_covg_on_nonzero_nodes = 0, min_covg = 2, percent_nonzero = 100}, {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 37}, {median_covg = 0,\ median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 0} <repeats 58 times>}

Phelimb commented 10 years ago

IN genotyping_known.c:505 :

B Var* var_to_update=array_vars[km]; vob->var_id=km; if (vob->var_id!= prev_mut) { //this is a new enum/mutation
reset_var_on_background(vob);
prev_mut=km; //for use in the next call to this function
} vob->gene=g;

=>vob->num_resistant_alleles=r;

If you reset the vob you also reset the var_id so you end up with something like this:

(gdb) p *vob $48 = {susceptible_allele = {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 0}, resistant_alleles = {{median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, \ percent_nonzero = 0} <repeats 60 times>}, num_resistant_alleles = 0, some_resistant_allele_present = 0 '\000', working_current_max_res_allele_present = 0, var_id = dfrB_L21V, gene = fusA}

typedef enum { dfrB_L21V = 0, ...

Going to make two changes.

  1. reset_var_on_background should reset var_id to NotSpecified not 0.
  2. if (vob->var_id!= prev_mut) -> if (km != prev_mut) and then set var_id AFTER we reset_var_on_background.
Phelimb commented 10 years ago

Turning back on error cleaning gives us a minor resistant call which we didn't make with error cleaning off...

** Called Variants var R_cov S_cov dfrB_H31N 100 100

(gdb) p *abi.vars[dfrB_H31N]->vob_best_sus $22 = {susceptible_allele = {median_covg = 580, median_covg_on_nonzero_nodes = 0, min_covg = 564, percent_nonzero = 100}, resistant_alleles = {{median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg\ = 0, percent_nonzero = 25}, {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 0} <repeats 59 times>}, num_resistant_alleles = 2, some_resistant_allele_present = 0 '\000\ ', working_current_max_res_allele_present = 25, var_id = dfrB_H31N, gene = dfrB}

(gdb) p *abi.vars[dfrB_H31N]->vob_best_res $23 = {susceptible_allele = {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 18}, resistant_alleles = {{median_covg = 26, median_covg_on_nonzero_nodes = 0, min_covg = 4\ , percent_nonzero = 100}, {median_covg = 5, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 68}, {median_covg = 0, median_covg_on_nonzero_nodes = 0, min_covg = 0, percent_nonzero = 0} <\ repeats 58 times>}, num_resistant_alleles = 2, some_resistant_allele_present = 1 '\001', working_current_max_res_allele_present = 100, var_id = dfrB_H31N, gene = dfrB} A 100 percent coverage allele with median covg 23 min 4 looks real to me but I don't understand why we didn't call with error cleaning off. It's also not present in the --align output.

Phelimb commented 10 years ago

The above also doesn't occur with WGAssemblyThenGenotyping:

https://www.dropbox.com/s/tjye5cu5ty1hlab/Screenshot%202014-09-10%2018.19.47.png?dl=0

iqbal-lab commented 10 years ago

I don't think we should be turning error-cleaning back on though

iqbal-lab commented 10 years ago

Looks like you have fixed the issue, right? Have you checked it does not regress the results?

Phelimb commented 10 years ago

I haven't fixed this issue.

It seems like the problem is that we don't use median coverage when deciding on which VOB to keep in Var. Using only percent_non_zero seems like it allows these extremely low coverage alleles to be stored (so long as they have high % coverage) and then when deciding on resistotype we can end up comparing two very low (median) coverage alleles.

This also explains why turning back on error cleaning 'fixes' the problem.

Phelimb commented 10 years ago

Possible solution could be to just replace the : if new_vob.per_cov > current_var.vob.per_cov update_var()

with if new_vob.per_cov >= current_var.vob.per_cov: if new_vob.median_cov > current_var.vob.median_cov: updatevar()

?

iqbal-lab commented 10 years ago

Agreed. make it so!