Hints on adding refcall POSITIONAL support to polyclone?

bredelings commented 2 years ago

Hi,

I was interested in looking at the code for the polyclone caller to see what it would take to make it output VCF records for every site. Would you be able to say a few words about how the calling process relates to the process of outputting VCF records? It seems (perhaps naively) like the calling process should generate the info that would go in the non-variant records whether it is printed or not.

So, looking at the code

it looks like the polyclone caller current returns {} from call_reference( ), wherase the individual caller returns more info.
it looks like the polyclone caller has the required fields in its Latents

So I'm curious if it you think it would be hard for me to fill out call_reference for the polyclone caller.

Thanks again for any help.

-BenRI

bredelings commented 2 years ago

As a side note, given the number of uses of std::transform( ), I wonder if using ranges-v3 could simplify a lot of code. It works with c++14, and it would allow using more readable idioms without requiring a flag day to convert more verbose idioms to ranges-v3. https://github.com/ericniebler/range-v3

As a caveat, it doesn't really correspond directly to what is in C++20. I see that there is a new repo called 'rangesnext' that assumes C++20, and implements some new features that are proposed for inclusion in C++23: https://github.com/cor3ntin/rangesnext

dancooke commented 2 years ago

If you just want basic reference calling functionality then you should be able to simply copy the implementation from IndividualCaller to PolycloneCaller. Note, however, that the model implemented is essentially diploid.

bredelings commented 2 years ago

Thanks, good to know! I was wondering how much I could use IndividualCaller implementation as a template.

(At the moment, I don't care if the probability that the homozygous-ness is correct, just the depth. I don't even need GVCF format, I just need records printed for non-variant sites.)

However, in the interest of doing a correct implementation -- is the assumption of diploidy just in the last few lines of compute_homozygous_posterior?

Also, is there a way of determining the local ploidy for a site in the polyclone caller?

dancooke commented 2 years ago

However, in the interest of doing a correct implementation -- is the assumption of diploidy just in the last few lines of compute_homozygous_posterior?

Correct.

Also, is there a way of determining the local ploidy for a site in the polyclone caller?

Yes, you can get the inferred local ploidy from the Latents object passed into PolycloneCaller::call_reference. Specifically, PolycloneCaller::Latents::model_log_posteriors_ gives the probability of clonality - if subclonality is indicated then you can get the inferred ploidy from polyploid_genotypes_.

bredelings commented 2 years ago

OK, so I started by copying over the code from individual caller: https://github.com/bredelings/octopus/commits/polyclone-refcall It required implementing PolycloneCaller::Latents::genotype_log_posteriors(), which I may have done correctly. I still need to access to local ploidy and compute the posterior correctly.

However... I tried using it, and no reference calls were shown. I tried compiling with -DCMAKE_BUILD_TYPE=RelWithDebInfo to investigate, but the compilation process took forever and ran out of RAM. I will try again with different flags later, but... I was wondering if you had a hint for (a) telling cmake to use -Og -g and not turn on the address sanitizer, and (b) why it might not be showing any reference sites.

dancooke commented 2 years ago

I just installed your fork and I do get reference calls from the polyclone caller. Are you sure you checked out your new branch polyclone-refcall after cloning your fork (default branch is develop)?

bredelings commented 2 years ago

Oops! I was indeed on the wrong branch somehow. I am seeing the reference calls now. So, the next thing is to pass the local ploidy down to compute_homozygous_posterior( ).

In the meantime, I'll note two problems that I've seen:

An assert triggers when running in debug mode:

octopus-debug: /home/bredelings/Devel/octopus/src/utils/concat.hpp:82: octopus::MappableBlock<T> octopus::concat(const octopus::MappableBlock<T>&, const octopus::MappableBlock<T>&) [with T = octopus::Genotype<octopus::IndexedHaplotype<> >]: Assertion `is_same_region(lhs, rhs)' failed.

The call to concat( ) is from the line auto genotypes = concat(haploid_genotypes_, polyploid_genotypes_); in PolycloneCaller::Latents::genotype_posteriors()

Apparently you can't supply the --refcall-filter-expression argument, because something is already supplying it:

$ ~/Devel/octopus/local/gcc-11/octopus -T LT635626 -I api.bam -R ../plasmo-combined.fasta -o test.g.vcf.gz --annotations AD -C polyclone --max-clones 2 --sequence-error-model PCR --threads 4 --refcall-filter-expression 'QUAL < 2 | GQ < 20 | MQ < 10 | DP < 10 | MF > 0.2' --refcall POSITIONAL
[2021-09-30 10:03:36] <INFO> ------------------------------------------------------------------------
[2021-09-30 10:03:36] <INFO> octopus v0.7.4 (polyclone-refcall ec838e75)
[2021-09-30 10:03:36] <INFO> Copyright (c) 2015-2021 University of Oxford
[2021-09-30 10:03:36] <INFO> ------------------------------------------------------------------------
[2021-09-30 10:03:36] <EROR> An unclassified error has occurred:
[2021-09-30 10:03:36] <EROR> 
[2021-09-30 10:03:36] <EROR>     Option '--refcall-filter-expression' cannot be specified more than
[2021-09-30 10:03:36] <EROR>     once.
[2021-09-30 10:03:36] <EROR> 
[2021-09-30 10:03:36] <EROR> To help resolve this error submit an error report.
[2021-09-30 10:03:36] <INFO> ------------------------------------------------------------------------

bredelings commented 2 years ago

Hi,

I have some questions about the logic in compute_homozygous_posterior( ). This applies to the code in the individual caller as well.

So, in this code:

        auto het_alt_ln_likelihood = std::inner_product(std::cbegin(reference_ln_likelihoods), std::cend(reference_ln_likelihoods),
                                                        std::cbegin(non_reference_ln_likelihoods), 0.0, std::plus<> {},
                                                        [] (auto ref, auto alt) { return maths::log_sum_exp(ref, alt) - std::log(2); });

it looks like maths::log_sum_exp(ref, alt) is summing Pr(ref|data) + Pr(not ref|data) on the log scale. But Pr(ref|data)+Pr(not ref|data) will always equal 1.0. I guess you would like Pr(data|ref) + Pr(data|alt)... but it is less clear how to get that from the PHRED score. But I am not totally sure I am following this correctly.

To double check this reasoning, I ran the code in debug mode, and I get:

reference_qualities[0] = 1
reference_ln_likelihoods[0] = -1.5814737534084538
non_reference_ln_likelihoods[0] = -0.23025850929940456

And indeed exp(-1.5814737534084538) + exp(-0.23025850929940456) = 1.0

-BenRI

P.S. I note that you are not using log1p(x) to implement log_sum_exp( ), but I'm not sure if that's by design. This is a nice function since it avoids losing precision when x is very small.

bredelings commented 2 years ago

I was also thinking that it might be easier to read the code if you implemented a class that stores the log of a value, and then defined operator+, operator*, etc.. Then you could write p1+p2 instead of maths::log_sum_exp(p1,p2).

dancooke commented 2 years ago

In the meantime, I'll note two problems that I've seen:

An assert triggers when running in debug mode:

Thanks - I'll see if I can reproduce.

Apparently you can't supply the --refcall-filter-expression argument, because something is already supplying it:

I think this may be an issue with Boost; it appears Boost incorrectly recognises --refcall as shorthand for --refcall-filter-expression thus counting the latter twice. Will have a think on what to do.

it looks like maths::log_sum_exp(ref, alt) is summing Pr(ref|data) + Pr(not ref|data) on the log scale. But Pr(ref|data)+Pr(not ref|data) will always equal 1.0. I guess you would like Pr(data|ref) + Pr(data|alt)... but it is less clear how to get that from the PHRED score.

The calculation is the genotype log likelihood ln p(R | g) - it essentially implements the log diploid form of equation 2 from Heng Li's paper. This will, of course, simplify to -N*log(2) for N reads in the diploid heterozygous case. I think I probably didn't implement this to make it easy to extend for non-diploid cases.

P.S. I note that you are not using log1p(x) to implement log_sum_exp( ), but I'm not sure if that's by design. This is a nice function since it avoids losing precision when x is very small.

Yup, std::log1p is the better option here. Changed in 27b48ff8adcc6ffc6fbf57ec503e45cbc4a1c224. Thanks.

I was also thinking that it might be easier to read the code if you implemented a class that stores the log of a value, and then defined operator+, operator*, etc.. Then you could write p1+p2 instead of maths::log_sum_exp(p1,p2).

It's an interesting idea but I think it actually makes things a lot more complicated and difficult to read. How is the mixture factor included, (p1 + p2) / 2 or p1 + p2 - log(2)? The latter is inconsistent since p1 + p2 + (-log(2)) has a different meaning. The former requires implementing a full symbolic algebra, including operator chaining (e.g. p1 + p2 + p3), and we still end up with confusing statements like L = (p1 + p2) / 2 where L, p1, and p2 are all Log types.

bredelings commented 2 years ago

The calculation is the genotype log likelihood ln p(R | g) - it essentially implements the log diploid form of equation 2 from Heng Li's paper. This will, of course, simplify to -N*log(2) for N reads in the diploid heterozygous case. I think I probably didn't implement this to make it easy to extend for non-diploid cases.

Hmm.... I was deriving this from scratch. I guess this is a valid scaling of the likelihoods and not an error. I see that Heng Li decided to ignore bases that are not ref and alt... that does simplify the transformation from PHRED to likelihoods, but it seems to drop a factor of (1/3) by assuming that any non-ref base is equal to alt. Perhaps that is tradition at this point...

In any case, I probably see how to implement this for polyclone now. Thanks.

It's an interesting idea but I think it actually makes things a lot more complicated and difficult to read. How is the mixture factor included, (p1 + p2) / 2 or p1 + p2 - log(2)? The latter is inconsistent since p1 + p2 + (-log(2)) has a different meaning. The former requires implementing a full symbolic algebra, including operator chaining (e.g. p1 + p2 + p3), and we still end up with confusing statements like L = (p1 + p2) / 2 where L, p1, and p2 are all Log types.

I have been using this in my own code for a while, and it works fairly well. Any operations on a Log type and a double return a Log type. So L = (p1 + p2)/2 automatically converts 2 to a Log type.

My code for this is here: https://github.com/bredelings/BAli-Phy/blob/master/src/util/include/util/math/log-double.H I just committed some cleanups, so this version is tested, but not heavily tested.

dancooke commented 2 years ago

I fixed the error when specifying --refcall. I also changed --refcall into a flag so it no longer accepts the BLOCKED or POSITIONAL argument - this is now controlled entirely by --refcall-block-merge-quality (--refcall-block-merge-quality=0 implies POSITIONAL).

bredelings commented 2 years ago

Thanks!

bredelings commented 2 years ago

I had a little bit of time today to move forward on the math for this.

I. Since you don't have estimates of the allele frequency, it looks like for the diploid case the code in individual_caller.cpp assumes that the prior probabilities Pr(homozygous ref) = Pr(heterozygous ref + alt). In theory you could have

Pr(ref|ref) = 1/4
Pr(ref|alt) = 1/4
Pr(alt|ref) = 1/4
Pr(alt|alt) = 1/4

But instead it looks like Pr(ref|ref) = Pr(ref|alt)+Pr(alt|ref). Does this sound correct?

II. To generalize this to ploidy p, I see two routes:

Pr(genotype) = (1/2)^p
Pr(number of reference alleles = r) = (1/(p+1)), and Pr(genotype with r ref alleles) = (1/(p+1)) * (1/choose(p,r))

The first approach weights pretty heavily against homozygosity (as ploidy increases) by assuming that the allele frequency is (1/2).

The second approach assigns each number of reference alleles an equal weight, which is hopefully similar to placing a uniform prior on allele frequency. This avoids weighting too heavily against homozygosity. It also matches what you did for diploids. So, I like this approach.

bredelings commented 2 years ago

It looks like, for diploids, you can consider the genotypes ref|ref and ref/alt, while ignoring alt|alt.

However, for polyploids, it looks like we can ignore homozygous alt, but not any of the other genotypes. So, we have to consider (2^p)-1 genotypes.

For the polyploid case (i.e. every haplotype has frequency 1/p) this can be done with a sum over p+1 terms.

I haven't thought yet about whether we can handle the polyclone case (i.e. unequal frequencies) without summing over 2^p terms...

luntergroup / octopus

Hints on adding refcall POSITIONAL support to polyclone? #206