Open rmcolq opened 5 years ago
went through exactly this with auto-cleaning for cortex. very hard to make a robust solution given how widely data can vary
Rediscovered why had this method of overriding neg bin model:
found mean 30.6201
found variance 30.8196
p: 0.993528 and r: 4700.38
Tue May 14 11:21:21 2019 Collect kmer probability distribution
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::domain_error> >'
what(): Error in function negative_binomial_distribution<double>::negative_binomial_distribution: Success fraction argument is 1.0085277557373047, but must be >= 0 and <= 1 !
/hps/nobackup2/iqbal/projects/pandora/klebs/holt_gastro/140519/pandora_work/5b/3dc907777f2ce30f32f9f2a7239eda/.command.sh: line 2: 40698 Aborted
Read file: /nfs/leia/research/iqbal/projects/pandora/klebs/holt_gastro/data/gff_files /2952_AJ033.30X.fa.gz
Currently pandora has two models allowed for kmer coverage distribution: negative binomial and binomial (approx poisson). Default is negative binomial.
In https://github.com/rmcolq/pandora/blob/7adf63ce60d28a000f7b6c850f4a5ebbbf2dd031/src/estimate_parameters.cpp#L235, if I find that the mean and variance of the kmer coverage distribution are almost identical, I force the poisson/binomial model.
However, this relies on finding the mean/first peak of the kmer coverage distribution by some hacky function which looks for the second peak after the one at 0. When we have non noisy data, this function doesn't work properly, and assumes we can fall back on default values for error rate. I have found a case where we couldn't:
Sample 2959_AJ064 has kmer coverage distribution which is too clean.
Not sure if this needs: a) better function to find second peak b) use mean of coverages > global_covg/10 as second peak as is used in neg bin c) ditch the override clause