Open peterk87 opened 2 years ago
Thanks for reporting @peterk87! Not sure what the best solution is here because by default the pipeline will and has always used the same reference fasta and annotation for reproducibility. It is up to the users to overwrite this configuration if required. Either we go low maintenance and add in a warning to ask users to point to their annotation or we deal with it internally somehow. Thoughts?
Ping @saramonzon.
yes..we've encountered this issue too, it's not only the orf1b gene, the case is that the NC_045512.2 fasta from refseq and the MN908947.3 fasta from genebank are exactly the same, but the gff from refseq is more updated and therefore correct. However the primer beds and stuff uses MN908947.3 genome idenfifier, that's why that is used in viralrecon. As @drpatelh says we overwrite the fasta, gff and primer bed files (in this case we just substitute MN908947.3 to NC_045512.2, as the coordinates are the same) so we get the correct annotation. I guess if nextclade is using the correct annotation but against MN908947.3, they should have done something similar but substituting the genome identifier in the gff.
It does not have a good solution in terms of reproducibility, it is true that if we want the best results for the defaults parameters we should be using the gff for NC_045512.2, but that means manually changing either the primer bed files or the gff...
Most other resources like the ARTIC primer sets etc all use MN908947.3
by default which is why it made sense to pick that. It meant we could just point to their primer beds and not have to host our own.
We could introduce a breaking change in the next release where we use a more updated annotation for MN908947.3
by copying across the GFF from NC_045512.2
and changing the values in the first column? But maybe we should also provide a mechanism where the older version can be used?
Are you using this annotation @saramonzon ? https://github.com/nf-core/configs/blob/9792226f5cecebf400205d435461ab3d047bed50/conf/pipeline/viralrecon/genomes.config#L15
I've just checked they've just updated de gff for both genebank and refseq!! Refseq: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/ Genbank: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/
Nice! Thanks @saramonzon
@peterk87 are you able to download the latest annotation for MN908947.3
and confirm that it fixes the issue for you?
I have downloaded the latest GFF file for MN908947.3
and I can't see a difference in the primary annotation which probably means it won't fix this issue. I have attached the files here to help troubleshoot. Any other suggestions?
GCA_009858895.3_ASM985889v3_genomic.200409.fna.gz GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz GCA_009858895.3_ASM985889v3_genomic.200409.gtf.gz
GCA_009858895.3_ASM985889v3_genomic.220131.fna.gz GCA_009858895.3_ASM985889v3_genomic.220131.gff.gz GCA_009858895.3_ASM985889v3_genomic.220131.gtf.gz
GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz GCF_009858895.2_ASM985889v3_genomic.200409.gff.gz GCF_009858895.2_ASM985889v3_genomic.200409.gtf.gz
GCF_009858895.2_ASM985889v3_genomic.220131.fna.gz GCF_009858895.2_ASM985889v3_genomic.220131.gff.gz GCF_009858895.2_ASM985889v3_genomic.220131.gtf.gz
Hi @drpatelh,
Thanks for looking into this so quickly! I was hoping there was a SnpEff option to get the correct AA coordinates for ORF1b mutations.
Unfortunately, the updated GFF from NCBI did not produce the results we wanted. For now we've resorted to using a custom GFF based on GCA_009858895.3_ASM985889v3_genomic.220131.gff
and it produces results that are in line with what we get from Nextclade.
In the modified GFF, the ORF1ab gene entry is replaced with 2 gene entries for ORF1a and ORF1b:
$ diff GCA_009858895.3_ASM985889v3_genomic.220131.gff MN908947.3-orf1a-orf1b-gene-split.gff
10,12c10,13
< MN908947.3 Genbank gene 266 21555 . + . ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
< MN908947.3 Genbank CDS 266 13468 . + 0 ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
< MN908947.3 Genbank CDS 13468 21555 . + 0 ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
---
> MN908947.3 Genbank gene 266 13468 . + . ID=gene-ORF1a;Name=ORF1a;gbkey=Gene;gene=ORF1a;gene_biotype=protein_coding
> MN908947.3 Genbank CDS 266 13468 . + 0 ID=cds-QHD43415.1;Parent=gene-ORF1a;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1a;product=ORF1a polyprotein;protein_id=QHD43415.1
> MN908947.3 Genbank gene 13468 21555 . + . ID=gene-ORF1b;Name=ORF1b;gbkey=Gene;gene=ORF1b;gene_biotype=protein_coding
> MN908947.3 Genbank CDS 13468 21555 . + 0 ID=cds-QHD43415.1;Parent=gene-ORF1b;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1b;product=ORF1b polyprotein;protein_id=QHD43415.1
Like you said, it's not great from a reproducibility standpoint to use a modified unofficial GFF, but at least we get the expected results for the ORF1b region.
Thanks @peterk87 ! That is very helpful. Not sure what to do about this since most of the standard annotations don't have an explicit entry for ORF1b
. Will leave this issue open here as reference.
It's also interesting to note that GISAID/CoVsurver reports AA substitutions for the component NSPs of ORF1ab/ORF1a, e.g.
N_G215C
NSP3_A1711V
Spike_P809S
Spike_T95I
N_D63G
NS7a_E91D
...
Spike_D614G
NSP4_V167L
Spike_L452R
This info is also present in the GISAID metadata table (metadata_tsv_yyyy_MM_dd.tar.xz
) you can download from GISAID under the "AA Substitutions" field. Unfortunately, I haven't been able to find much info about how or what they use for generating this info.
We've encountered issues with the reporting of ORF1ab mutations falling in the ORF1b region where the -1 frameshift is not taken into account by SnpEff.
Nextclade uses a different GFF than viralrecon when using MN908947.3 genome profile:
https://github.com/nextstrain/nextclade/blob/96e42468cae447afdbc9d0eda73f160e8049e6e4/data/sars-cov-2/genemap.gff#L8
https://github.com/nf-core/test-datasets/blob/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz
So for example a C14322T nucleotide substitution will be reported by SnpEff as a T4686I AA substitution, but Nextclade will only report the nucleotide substitution and no AA substitution.
The AA at position 4686 should be Y instead of the T reported by SnpEff:
screenshot from https://www.ncbi.nlm.nih.gov/protein/QHD43415.1?report=graph
Potential solution
Use a custom GFF with separate ORF1a and ORF1b "gene" entries like Nextclade does.