Open ammaraziz opened 1 year ago
Hi @ammaraziz, thanks for your great feedback and happy to have you trialling piranha!
Both reference sets were compiled by @AShaw1802 and @CatherineTroman and I believe the discrepancy is due to greater availability of publically available VP1 sequences than whole genome sequences, but they may be able to clarify!
And you're completely right- consistency is definitely preferred! In general I've been focusing on developing piranha with the VP1 protocol in mind and intend to further address the whole genome options going forward.
Almost certainly both reference sets could do with an update as they were put together a couple of years ago now!
Thanks @aineniamh appreciate the work that goes into collating reference sequences. I will try to contribute back to the project with a curated set of references when I have time.
On the topic of references, the A12 'VP1' should be removed from the VP1 reference set. https://www.ncbi.nlm.nih.gov/nuccore/AB126210
I think It contains mostly UTR-5' and some VP4. It was giving odd blast results but aligning it to another A12 genome identifies it's true sequence.
Thanks,
Ammar
Ah I can see that there in the phylogeny you're completely right. I'll remove that reference now!
I've created a whole genome reference set based on the VP1 reference and formatted the headers to run with the pipeline. I took the VP1 accession numbers, downloaded the sequences and filltered for near full length genome. There's one or two short sequences. entero_refset_wg_v1.fasta.zip
Ah that looks great! I'll look into amalgamating them with our current reference set- we have more representatives of WPV1 I believe, but will investigate!
Sorry for prematurely closed the issue. If you do update the refset please let me know. I'd be keen to know what sequences you'd add, especially for non-polio viruses.
Hi Aine,
I hope you don't mind me jumping in here. This is a fantastic pipeline, we're looking at incorporating both the wetlab protocols and this pipeline into our system.
While looking at the references, I noticed the whole genome fasta reference is missing many species that are present in VP1 reference. For example CoxsackievirusB3 is missing from wg reference. Not sure if it makes sense to include all VP1 viruses as whole genome reference, not an expert on polio.
Also, sorry to be nit picky, the species names differ between the two files, eg: CoxsackieA1 vs CoxsackievirusA1
Regards,
Ammar