SARS-CoV-2 ORF1b mutation reporting differences between SnpEff and Nextclade

peterk87 commented 2 years ago

We've encountered issues with the reporting of ORF1ab mutations falling in the ORF1b region where the -1 frameshift is not taken into account by SnpEff.

Nextclade uses a different GFF than viralrecon when using MN908947.3 genome profile:

.   .   gene    266 13468   .   +   .    gene_name=ORF1a
.   .   gene    13468   21555   .   +   .    gene_name=ORF1b

https://github.com/nextstrain/nextclade/blob/96e42468cae447afdbc9d0eda73f160e8049e6e4/data/sars-cov-2/genemap.gff#L8

MN908947.3  Genbank gene    266 21555   .   +   .   ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
MN908947.3  Genbank CDS 266 13468   .   +   0   ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
MN908947.3  Genbank CDS 13468   21555   .   +   0   ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1

https://github.com/nf-core/test-datasets/blob/viralrecon/genome/MN908947.3/GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz

So for example a C14322T nucleotide substitution will be reported by SnpEff as a T4686I AA substitution, but Nextclade will only report the nucleotide substitution and no AA substitution.

The AA at position 4686 should be Y instead of the T reported by SnpEff:

screenshot from https://www.ncbi.nlm.nih.gov/protein/QHD43415.1?report=graph

Potential solution

Use a custom GFF with separate ORF1a and ORF1b "gene" entries like Nextclade does.

drpatelh commented 2 years ago

Thanks for reporting @peterk87! Not sure what the best solution is here because by default the pipeline will and has always used the same reference fasta and annotation for reproducibility. It is up to the users to overwrite this configuration if required. Either we go low maintenance and add in a warning to ask users to point to their annotation or we deal with it internally somehow. Thoughts?

Ping @saramonzon.

saramonzon commented 2 years ago

yes..we've encountered this issue too, it's not only the orf1b gene, the case is that the NC_045512.2 fasta from refseq and the MN908947.3 fasta from genebank are exactly the same, but the gff from refseq is more updated and therefore correct. However the primer beds and stuff uses MN908947.3 genome idenfifier, that's why that is used in viralrecon. As @drpatelh says we overwrite the fasta, gff and primer bed files (in this case we just substitute MN908947.3 to NC_045512.2, as the coordinates are the same) so we get the correct annotation. I guess if nextclade is using the correct annotation but against MN908947.3, they should have done something similar but substituting the genome identifier in the gff.

It does not have a good solution in terms of reproducibility, it is true that if we want the best results for the defaults parameters we should be using the gff for NC_045512.2, but that means manually changing either the primer bed files or the gff...

drpatelh commented 2 years ago

Most other resources like the ARTIC primer sets etc all use MN908947.3 by default which is why it made sense to pick that. It meant we could just point to their primer beds and not have to host our own.

We could introduce a breaking change in the next release where we use a more updated annotation for MN908947.3 by copying across the GFF from NC_045512.2 and changing the values in the first column? But maybe we should also provide a mechanism where the older version can be used?

Are you using this annotation @saramonzon ? https://github.com/nf-core/configs/blob/9792226f5cecebf400205d435461ab3d047bed50/conf/pipeline/viralrecon/genomes.config#L15

saramonzon commented 2 years ago

I've just checked they've just updated de gff for both genebank and refseq!! Refseq: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/ Genbank: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/009/858/895/GCA_009858895.3_ASM985889v3/

drpatelh commented 2 years ago

Nice! Thanks @saramonzon

@peterk87 are you able to download the latest annotation for MN908947.3 and confirm that it fixes the issue for you?

drpatelh commented 2 years ago

I have downloaded the latest GFF file for MN908947.3 and I can't see a difference in the primary annotation which probably means it won't fix this issue. I have attached the files here to help troubleshoot. Any other suggestions?

MN908947.3

Download date: 04th April 2020 (current default)

GCA_009858895.3_ASM985889v3_genomic.200409.fna.gz GCA_009858895.3_ASM985889v3_genomic.200409.gff.gz GCA_009858895.3_ASM985889v3_genomic.200409.gtf.gz

Download date: 31st January 2022

GCA_009858895.3_ASM985889v3_genomic.220131.fna.gz GCA_009858895.3_ASM985889v3_genomic.220131.gff.gz GCA_009858895.3_ASM985889v3_genomic.220131.gtf.gz

NC_045512.2

Download date: 04th April 2020 (current default)

GCF_009858895.2_ASM985889v3_genomic.200409.fna.gz GCF_009858895.2_ASM985889v3_genomic.200409.gff.gz GCF_009858895.2_ASM985889v3_genomic.200409.gtf.gz

Download date: 31st January 2022

GCF_009858895.2_ASM985889v3_genomic.220131.fna.gz GCF_009858895.2_ASM985889v3_genomic.220131.gff.gz GCF_009858895.2_ASM985889v3_genomic.220131.gtf.gz

peterk87 commented 2 years ago

Hi @drpatelh,

Thanks for looking into this so quickly! I was hoping there was a SnpEff option to get the correct AA coordinates for ORF1b mutations.

Unfortunately, the updated GFF from NCBI did not produce the results we wanted. For now we've resorted to using a custom GFF based on GCA_009858895.3_ASM985889v3_genomic.220131.gff and it produces results that are in line with what we get from Nextclade.

In the modified GFF, the ORF1ab gene entry is replaced with 2 gene entries for ORF1a and ORF1b:

$ diff GCA_009858895.3_ASM985889v3_genomic.220131.gff MN908947.3-orf1a-orf1b-gene-split.gff

10,12c10,13
< MN908947.3    Genbank gene    266     21555   .       +       .       ID=gene-orf1ab;Name=orf1ab;gbkey=Gene;gene=orf1ab;gene_biotype=protein_coding
< MN908947.3    Genbank CDS     266     13468   .       +       0       ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
< MN908947.3    Genbank CDS     13468   21555   .       +       0       ID=cds-QHD43415.1;Parent=gene-orf1ab;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=orf1ab;product=orf1ab polyprotein;protein_id=QHD43415.1
---
> MN908947.3    Genbank gene    266     13468   .       +       .       ID=gene-ORF1a;Name=ORF1a;gbkey=Gene;gene=ORF1a;gene_biotype=protein_coding
> MN908947.3    Genbank CDS     266     13468   .       +       0       ID=cds-QHD43415.1;Parent=gene-ORF1a;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1a;product=ORF1a polyprotein;protein_id=QHD43415.1
> MN908947.3    Genbank gene    13468   21555   .       +       .       ID=gene-ORF1b;Name=ORF1b;gbkey=Gene;gene=ORF1b;gene_biotype=protein_coding
> MN908947.3    Genbank CDS     13468   21555   .       +       0       ID=cds-QHD43415.1;Parent=gene-ORF1b;Dbxref=NCBI_GP:QHD43415.1;Name=QHD43415.1;Note=translated by -1 ribosomal frameshift;exception=ribosomal slippage;gbkey=CDS;gene=ORF1b;product=ORF1b polyprotein;protein_id=QHD43415.1

Gist here: MN908947.3-orf1a-orf1b-gene-split.gff

Like you said, it's not great from a reproducibility standpoint to use a modified unofficial GFF, but at least we get the expected results for the ORF1b region.

drpatelh commented 2 years ago

Thanks @peterk87 ! That is very helpful. Not sure what to do about this since most of the standard annotations don't have an explicit entry for ORF1b. Will leave this issue open here as reference.

peterk87 commented 2 years ago

It's also interesting to note that GISAID/CoVsurver reports AA substitutions for the component NSPs of ORF1ab/ORF1a, e.g.

N_G215C
NSP3_A1711V
Spike_P809S
Spike_T95I
N_D63G
NS7a_E91D
...
Spike_D614G
NSP4_V167L
Spike_L452R

This info is also present in the GISAID metadata table (metadata_tsv_yyyy_MM_dd.tar.xz) you can download from GISAID under the "AA Substitutions" field. Unfortunately, I haven't been able to find much info about how or what they use for generating this info.

nf-core / viralrecon