pcingola / SnpEff

Other
244 stars 78 forks source link

errors in protein sequence output #284

Closed gbottu closed 3 years ago

gbottu commented 3 years ago

I noticed the following problem. I use SnpEff 5.0 (build 2020-08-09 21:23) I run the command java -jar snpEff.jar -i vcf -o vcf -fastaProt out.fa hg38 in.vcf > out.vcf Now look in the file out.fa at the sequence NM_001135865.1. At position 138 there should be a G, except of course for "Variant 16:22527841-22527841 Ref:G Alt:T HGVS.p:p.Gly138Val" where there must be a V. However, you often see a V instead of a G.

I am aware that the parameter -fastaProt is not documented. Is it possible that this bug is a well known old issue ? Yet, I would be very happy if tis bug could be fixed, since we are developing an analysis pipeline and SnpEff seemed perfect for what we need.

Thanks in advance Guy Bottu Flemish Institute for Biotechnology

in.vcf.txt out.fa.txt out.vcf.txt

pcingola commented 3 years ago

Hi, First, apologies for taking such a long time to reply.

I should clarify that this is not a bug, but just a misinterpretation of the (rather confusing) results.

The issue is that there are several mappings of NM_001135865.1 You can read some details of why this happens and how SnpEff deals with these situations in this FAQ: https://pcingola.github.io/SnpEff/se_faq/#multiple-version-of-refseq-transcripts

In one of these mappings, NM_001135865.1 at amino acid 138 is 'G', whereas in alternative mapping NM_001135865.1.4 amino acid 138 is 'V'.

The confusion is further exacerbated by the fact that one of your variants is exactly Gly138Val, which matches the exact difference between two of the four mappings for transcript NM_001135865.1. It is likely that this is not just a coincidence, but the variant itself might be a result of sequencing reads mapped to an alternative genomic position.

Readin your out.fa file, you can see that all NM_001135865.1 have amino acid G at position 138 (except the variant, of course)

Scroll right to see the difference ------>>>
                                                                                                                                         | Here is amino acid 138, all are 'G'
>NM_001135865.1 Ref                                                                                                                      |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIGLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
>NM_001135865.1 Variant 16:22527841-22527841 Ref:G Alt:T HGVS.p:p.Gly138Val                                                              |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIVLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
>NM_001135865.1 Ref                                                                                                                      ^ Here is the 'V'
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIGLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
>NM_001135865.1 Variant 16:22527961-22527961 Ref:C Alt:G HGVS.p:p.Thr178Ser                                                              |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIGLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKSARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
>NM_001135865.1 Ref                                                                                                                      |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIGLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
>NM_001135865.1 Variant 16:22536189-22536189 Ref:C Alt:G HGVS.p:p.Pro1069Arg                                                             |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIGLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
                                                                                                                                         | All are 'G' except in Gly138Val which has a 'V'

But there is another (very similar) version of the transcript, which SnpEff reported as NM_001135865.1.4, in this case, your out.fa file correctly reports alternative mapping as 'V' in possition 138:

Scroll right to see the difference ------>>>
                                                                                                                                         | Here is amino acis 138, all are 'V'
>NM_001135865.1.4 Ref                                                                                                                    |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIVLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...
>NM_001135865.1.4 Variant 16:21836331-21836331 Ref:G Alt:T HGVS.p:p.Pro639Thr                                                            |
MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIVLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMAAAEHRHSSGLPYWPYLTAET...

To further show you the difference between NM_001135865.1 and alternative mapping NM_001135865.1.4, you can use the show command in SnpEff:

$ java -jar snpEff.jar show -v hg38 NM_001135865.1 NM_001135865.1.4 | tee tmp.txt

# Note: Output edited for readability
# Scroll right to see the difference ------>>>
#                                                                                                                                                        | AA 138
#                                                                                                                                                        |
/Users/kqrw311/snpEff/issue_284$                                                                                                                         |
Showing genes and transcripts using zero-based coordinates                                                                                               |
Transcript (codon table: Standard ) :   16:22513522-22536519, strand: +, id:NM_001135865.1, Protein, DNA check                                           |
    ...                                                                                                                                                  |
    Protein :   MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIGLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMA...
    ...                                                                                                                                                  | NM_001135865.1 has a 'G'
                                                                                                                                                         |
Transcript (codon table: Standard ) :   16:21834582-21857656, strand: -, id:NM_001135865.1.4, Protein                                                    |
    ...                                                                                                                                                  |
    Protein :   MVKLSIVLTPQFLSHDQGQLTKELQQHVKSVTCPCEYLRKVINTLADHHHRGTDFGGSPWLHVIIAFPTSYKVVITLWIVYLWVSLLKTIFWSRNGHDGSTDVQQRAWRSNRRRQEGLRSICMHTKKRVSSFRGNKIVLKDVITLRRHVETKVRAKIRKRKVTTKINHHDKINGKRKTARKQKMFQRAQELRRRAEDYHKCKIPPSARKALCNWVRMA...
    ...                                                                                                                                                  | NM_001135865.1.4 has a 'V'

I understand that this is a rather confusing situation, particularly because most people do expect all transcripts to be uniquely mapped to the genome, unfortunately, this basic assumption is not true for RefSeq. This is one of the reasons why some people, myself included, prefer to use ENSEMBL references (GRCh38) instead of RefSeq (hg38), whenever possible.

I hope this clarifies.

pcingola commented 3 years ago

Closing. See the explanation on why I don't think this is an error / bug.