luntergroup / octopus

Bayesian haplotype-based mutation calling
MIT License
302 stars 38 forks source link

polyclone haplotype relative abundance #127

Closed slhogle closed 4 years ago

slhogle commented 4 years ago

Hi,

Thanks for the nice software. I've been using Octopus for microbial samples in polyclone mode so far with good success. I have two questions that I couldn't find an answer for either in the preprint or here in the documentation.

  1. In polyclone mode Is it possible to estimate relative abundances of haplotypes (ie strains) from the data? For example, for three haplotypes A, B, and C could you say something like A is 60%, B is 10%, and C is 30%? I see that you can get the allele frequency (AF field) in the VCF for each allele but I was wondering if there was a way to summarize this for the inferred haplotypes.

  2. My samples are longitudinal from an experimental evolution timecourse so I expect to see temporal correlation between allele frequencies. Is it possible to call variants on multiple samples simultaneously and to use the temporal nature of the dataset to inform the variant calls?

Thanks for your help! -shane

bredelings commented 4 years ago

I'd be interested in this as well. We are studying reads from malaria, and its possible for a person to be infected by multiple strains so that the frequencies in the blood are e.g. 60%, 30%, 10%. In this case, it would be very useful for us to get out the mixture frequencies.

When I read the documentation, it sounded like octopus was was using mixture proportions like this internally. However, I wonder if octopus actually constructs global mixture proportions.

Note that it might be possible to use something like DEploid to deconvolute your reads after running octopus (https://github.com/DEploid-dev/DEploid). If it was possible to use octopus alone to determine that the highest-frequency-component has a frequency > 80%, I would be quite happy though.

dancooke commented 4 years ago

In polyclone mode Is it possible to estimate relative abundances of haplotypes (ie strains) from the data? For example, for three haplotypes A, B, and C could you say something like A is 60%, B is 10%, and C is 30%? I see that you can get the allele frequency (AF field) in the VCF for each allele but I was wondering if there was a way to summarize this for the inferred haplotypes.

Octopus does model sample haplotype mixture proportions for calling, the model is actually the same as used for tumour calling (i.e. the cancer calling model). Currently, the cancer calling model reports haplotype frequency inferences for some variants, but the polyclone calling model doesn't. I'll leave this issue open as a feature-request.

If you want empirical frequency estimates then one thing you can do is use the --bamout feature and use the haplotype assignment tag (HP).

My samples are longitudinal from an experimental evolution timecourse so I expect to see temporal correlation between allele frequencies. Is it possible to call variants on multiple samples simultaneously and to use the temporal nature of the dataset to inform the variant calls?

The polyclone model doesn't currently support joint-calling. The genotype model used does allow for multiple samples but I'm not sure the prior is appropriate for longitudinal data. Essentially the model allows a single genotype for all samples but independent haplotype mixture proportions. This is leveraged by the cancer calling model to allow for multi-sample bulk tumour data but probably won't fit well for other types of data where less correlation between samples is expected; ideally each sample should be allowed to have their own genotype with some sort of phylogeny-aware prior.

slhogle commented 4 years ago

thanks for the quick response. I figured from #12 that the joint calling in this context would be complex, especially for repeated measures data. Thought I would ask anyway...

The MAP_HF and HF_CR statistics would be very nice to have for polyclone model. Appreciate you considering the feature request.

thanks, -shane

dancooke commented 4 years ago

@slhogle HPC and MAP_HF annotations added to non-haploid polyclone calls in e04bc66c2718acc0b94a4f63f09453812f0c4dc9.