iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Is `find_mean_covg' still appropriate in estimate parameters? #129

Open rmcolq opened 5 years ago

rmcolq commented 5 years ago

Currently pandora has two models allowed for kmer coverage distribution: negative binomial and binomial (approx poisson). Default is negative binomial.

In https://github.com/rmcolq/pandora/blob/7adf63ce60d28a000f7b6c850f4a5ebbbf2dd031/src/estimate_parameters.cpp#L235, if I find that the mean and variance of the kmer coverage distribution are almost identical, I force the poisson/binomial model.

However, this relies on finding the mean/first peak of the kmer coverage distribution by some hacky function which looks for the second peak after the one at 0. When we have non noisy data, this function doesn't work properly, and assumes we can fall back on default values for error rate. I have found a case where we couldn't:

bsub.py 2 pandora_compare_in_chunks_100519 nextflow run /nfs/leia/research/iqbal/rmcolq/git/DPhil_analysis/nextflow/pandora_compare_in_chunks.nf --pangenome_prg /nfs/leia/research/iqbal/projects/pandora/klebs/holt_gastro/prg/holt2015/combined_genes_igr10/pangenome_PRG.fa --read_index /nfs/leia/research/iqbal/projects/pandora/klebs/holt_gastro/data/gff_files/INDEX_30 --chunk_size 4000 --num_samples 151 --k 31 --w 19 --max_covg 50 --illumina -w pandora_work -c /nfs/leia/research/iqbal/rmcolq/git/DPhil_analysis/nextflow/nextflow.config -resume

Sample 2959_AJ064 has kmer coverage distribution which is too clean.

Not sure if this needs: a) better function to find second peak b) use mean of coverages > global_covg/10 as second peak as is used in neg bin c) ditch the override clause

iqbal-lab commented 5 years ago

went through exactly this with auto-cleaning for cortex. very hard to make a robust solution given how widely data can vary

rmcolq commented 5 years ago

Rediscovered why had this method of overriding neg bin model:

found mean 30.6201
found variance 30.8196
p: 0.993528 and r: 4700.38
Tue May 14 11:21:21 2019 Collect kmer probability distribution
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
  what():  Error in function negative_binomial_distribution<double>::negative_binomial_distribution: Success fraction argument is 1.0085277557373047, but must be >= 0 and <= 1 !
/hps/nobackup2/iqbal/projects/pandora/klebs/holt_gastro/140519/pandora_work/5b/3dc907777f2ce30f32f9f2a7239eda/.command.sh: line 2: 40698 Aborted 

Read file: /nfs/leia/research/iqbal/projects/pandora/klebs/holt_gastro/data/gff_files /2952_AJ033.30X.fa.gz