polio-nanopore / piranha

GNU General Public License v3.0
16 stars 4 forks source link

Whole genome reference fasta missing virsues #111

Open ammaraziz opened 1 year ago

ammaraziz commented 1 year ago

Hi Aine,

I hope you don't mind me jumping in here. This is a fantastic pipeline, we're looking at incorporating both the wetlab protocols and this pipeline into our system.

While looking at the references, I noticed the whole genome fasta reference is missing many species that are present in VP1 reference. For example CoxsackievirusB3 is missing from wg reference. Not sure if it makes sense to include all VP1 viruses as whole genome reference, not an expert on polio.

Also, sorry to be nit picky, the species names differ between the two files, eg: CoxsackieA1 vs CoxsackievirusA1

Regards,

Ammar

aineniamh commented 1 year ago

Hi @ammaraziz, thanks for your great feedback and happy to have you trialling piranha!

Both reference sets were compiled by @AShaw1802 and @CatherineTroman and I believe the discrepancy is due to greater availability of publically available VP1 sequences than whole genome sequences, but they may be able to clarify!

And you're completely right- consistency is definitely preferred! In general I've been focusing on developing piranha with the VP1 protocol in mind and intend to further address the whole genome options going forward.

Almost certainly both reference sets could do with an update as they were put together a couple of years ago now!

ammaraziz commented 1 year ago

Thanks @aineniamh appreciate the work that goes into collating reference sequences. I will try to contribute back to the project with a curated set of references when I have time.

On the topic of references, the A12 'VP1' should be removed from the VP1 reference set. https://www.ncbi.nlm.nih.gov/nuccore/AB126210

I think It contains mostly UTR-5' and some VP4. It was giving odd blast results but aligning it to another A12 genome identifies it's true sequence.

Thanks,

Ammar

aineniamh commented 1 year ago

Ah I can see that there in the phylogeny you're completely right. I'll remove that reference now!

ammaraziz commented 10 months ago

I've created a whole genome reference set based on the VP1 reference and formatted the headers to run with the pipeline. I took the VP1 accession numbers, downloaded the sequences and filltered for near full length genome. There's one or two short sequences. entero_refset_wg_v1.fasta.zip

aineniamh commented 10 months ago

Ah that looks great! I'll look into amalgamating them with our current reference set- we have more representatives of WPV1 I believe, but will investigate!

ammaraziz commented 10 months ago

Sorry for prematurely closed the issue. If you do update the refset please let me know. I'd be keen to know what sequences you'd add, especially for non-polio viruses.