Check for non-target primer matches

zachary-foster commented 1 year ago

This is not trivial for VCF-based input, since the sequences of each group are not known. It would be technically possible to reconstruct each group specific sequence as diagnositic sites are being searched for and then using primer blast after to further filter the sites. bcftools consensus might also be able to make the group-specific sequences given the right input.

could also just use primer blast on the reference sequence, but it might miss some things.

zachary-foster commented 1 year ago

note: primer blast does not seem to be locally-installable, only as a web interface, so need another solution

grunwald commented 1 year ago

Would this help: https://bioconda.github.io/recipes/primer3/README.html. Installation via conda or docker?

grunwald commented 1 year ago

Se also: https://bioinformatics.stackexchange.com/questions/18580/can-primer-blast-be-run-locally NCBI and Primer3 would need to come to a joint agreement, which they must have done to present the webserver (at least it would be polite). You can of course replicate the pipeline yourself.

Brief details are here, https://www.ncbi.nlm.nih.gov/tools/primer-blast/primerinfo.html

It can be done via Docker. This exact pipeline hasn't been distributed via Docker but Blast is available as a Docker image.

New Answer Just to follow this up, primers4clades is a Docker image aimed at very similar objectives to Primer-Blast. I have not used it, but its worth considering if standalone is essential, e.g. for ease of building a pipeline.

Here's the blurb

Primers4clades provides a fully automated pipeline for the design of PCR primers for cross-species amplification of novel sequences from metagenomic DNA or from uncharacterized organisms belonging to user-specified phylogenetic lineages. It implements an extended CODEHOP strategy based on both DNA and protein multiple alignments of coding genes and evaluates thermodynamic properties of the oligonucleotide pairs, as well as the phylogenetic information content of predicted amplicons, computed from the branch support values of maximum likelihood phylogenies.

I don't know the user, but Docker is a safe running environment that is isolated from the system. There's no harm in checking.

Finally both Blast and Primer3 are available via conda, but not Primer-Blast. In theory, I believe Primer-Blast could be distributed via conda under the rules of conda. In fact, I think, but could be wrong, that anyone could rebuild this pipeline and post it back to conda. @merv will know exactly whether this is 'permitable' and is very likely to correct me if I'm wrong.

zachary-foster commented 1 year ago

Thanks for looking into it! Yea, I saw that post. I have not looked into to primer4clades too much, but since its a webserver app, its probably not a good fit. Also I would like to avoid a dependency that can only be installed by a docker container. I was hoping for something that would be easy to install and use like BLAST, but it it looks like primer-blast is also made to be a web server and is not available for local install. Anyway, even that is not a perfect solution, since we don't actually have the sequences from all the samples, just the variants, so it would be blasting against the reference, which might miss some things depending on how different the reference is to the samples.

I see a few options for how to approach this:

Just BLAST the whole amplified region to the reference genome and filter out any results that have multiple hits of a given similarity. This would be quick and easy, but not might be very accurate for samples that diverge from the reference significantly.
Use bcftools consensus (https://samtools.github.io/bcftools/bcftools.html#consensus) to create a consensus sequence for each sample from the VCF and the reference and BLAST those. Would still need to optimize BLAST for that purpose the same way they did for primer-blast. Could also generate another VCF that has a single "sample" representing each group by combining allele counts by sample, which would speed the process up a lot and allow for the quality filtering to be applied.
Same as 2 except modify the current code to create the group consensus sequence from the entire reference and filtered variants and stream it to a file instead of using bcftools. This would be the best solution probably but relies on my code to reconstruct the sequences working right.

zachary-foster commented 1 year ago

Could perhaps use http://primegens.org/home_html/SSPD.html or a similar strategy. Could also just use primer3 on the whole reference template with the region of interest masked out and check for other matches, assuming primer3 can handle such large template sequences

grunwaldlab / krisp

Check for non-target primer matches #4