tseemann / snippy

:scissors: :zap: Rapid haploid variant calling and core genome alignment
GNU General Public License v2.0
471 stars 115 forks source link

Freebayes --pooled-continuous option? #311

Closed Buuntu closed 5 years ago

Buuntu commented 5 years ago

Is there an option to run freebayes with the --pooled-continuous option, and if not, would it be doable/easy to add in a newer version?

tseemann commented 5 years ago

@Buuntu i thought the --pooled-* options were for when you perform multi-sample joint variant calling?

snippy calls one sample at a time and merges them independently. this was a desigh decision to make incremental tree building faster.

But if you want, the help shows this:

  --fbopt F        Extra Freebayes options, eg. --theta 1E-6 --read-snp-limit 2 (default '')

So just use --fbopt '--pooled-continuous' and let me know if it works. The quotes are important!

Buuntu commented 5 years ago

Thanks @tseemann! I'll try the --fbopt option. My understanding is that --pooled-continuous is for when you do population level sequencing (ie not a single isolate). So you may not know how many biological samples you're really sequencing in one go. It allows you to call SNPs based on allele frequencies rather than considering it as a single biological sample.

Buuntu commented 5 years ago

@tseemann While it does seem to be passing that option on to freebayes, I think the other hardcoded settings may interfere, such as -p 2. I don't know which is taking precedence in this case since the two are not meant to be used together, my guess is the ploidy setting is.

Why is ploidy set to 2 and not 1 by the way? Looking at the vcf file it generates the command looks like this:

freebayes -p 2 -P 0 -C 10 --min-repeat-entropy 1.5 --strict-vcf -q 13 -m 60 --min-coverage 10 -F 0.05 --pooled-continuous -f reference/ref.fa snps.bam --region contig_1_pilon:0-123479
tseemann commented 5 years ago

I use -p 2 for a good reason. If you use -p 1 then mismapping reads will cause false +ve SNP calls. If you assume haploid, and you get REF=70xA and ALT=30xT then freebayes is going to call ALT, as it certainly isn't REF under a haploid model, but we've given it no chance to say it's mixed. Diploid gives this chance. Some SNP callers for bacteria (eg. nesoni) used even higher ploidy to help here.I learnt this lesson the hard way in Snippy 3.x

Snippy is just a dumb wrapper around bwa and freebayes. It is not much work for you to create your own simple pipeline to use the settings you want. This recent blog post of mine may inspire you :-)

http://thegenomefactory.blogspot.com/2018/10/a-unix-one-liner-to-call-bacterial.html

Buuntu commented 5 years ago

Thanks for the clarification @tseemann!

You are right that I can just make my own pipeline, but I like how easy to use your tools are (and cover most of my use cases). Will go ahead and create my own pipeline for these edge cases.

rspreafico-vir commented 4 years ago

Hi Torsten, a follow-up question on this that likely does not deserve its own issue: what parameters would you recommend to pass to freebayes if I were to use Snippy for calling variants from multiple samples, each being a mixture of viral sequences (1% frequency threshold, often 5000x average coverage)?

I think I am certain about --pooled-continuous but have doubts around --pooled-discrete, --ploidy, whether to keep priors, and maybe some other minor parameters (whether to limit search to best n alleles without losing sensitivity given 1% frequency threshold, whether to zero out haplotype length).

Thanks for any recommendation!

tseemann commented 4 years ago

@rspreafico-vir you should not use Snippy for calling variants in mixtures. it's for halpoid "pure, single colony" isolates only.

rspreafico-vir commented 4 years ago

Gotcha, thanks! I was hoping because of the ability to pass the --pooled-continuous parameter. Will not use Snippy, but if you have ever looked into calling variants from mixtures with freebayes and have any suggestion, I would only be happy to learn :-) Thank you for your prompt response!

Buuntu commented 4 years ago

@rspreafico-vir I would recommend you take a look at Breseq. Here's a tutorial on how to do it for mixed samples: https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/tutorial_populations.html.

It has a conda recipe, is easy to run, and generates really nice HTML output. It will tell you all of the mutations and the % of the population they are in, and even look for larger indels.

rspreafico-vir commented 4 years ago

That's awesome, thanks so much for your recommendation!

tseemann commented 4 years ago

@rspreafico-vir it's also in brew. i wrote the formula for it. just be aware it only supports single end reads.