phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

contrary wzx wzy predictions #29

Closed alantsangmb closed 6 years ago

alantsangmb commented 6 years ago

I have got an isolate with top hit to M84642|6,7,14,[54]|C1 for wzx (97.904% pident; 100% coverage) and 17|gb|11|F for wzy (97.5% pident; 100% coverage). So it gives serogroup C1 for serogroup_prediction.wzx_prediction.serogroup, F for serogroup_prediction.wzx_prediction.serogroup, but C1 for serogroup_prediction.serogroup , and 11 for o_antigen.

For the serovar_antigen, it returns Aberdeen|Augustenborg (11:i:1,2 | 6,7,[14]:i:1,2), so sistr actually knows there are two different antigen predictions.

Is it a very rare case? I wonder if you could explain how to tell the serogroup_prediction.serogroup/o_antigen when we get different prediction based on wzx and wzy like this case?

Thanks in advance.

peterk87 commented 6 years ago

Hi @alantsangmb

The sequences for the C1 and F serogroups within the wzx and wzy genes.

Please see this explanation here:

https://bitbucket.org/peterk87/sistr_backend/src/master/docs/serovar_prediction.rst

Specifically:

For some O- or H-antigens predictions, there may be ambiguity in what the actual antigen should be due to high nucleotide similarity between alleles, thereby potentially leading to multiple possible serovar predictions based on the antigen predictions. For example, if "g,m" is predicted for the H1 antigen, the actual H1 antigen could be "g,m,s" or "g,p" or various other g-complex antigens due to the high molecular similarity between the sequences for these antigens.

These ambiguities in antigen predictions are present for all antigens and are taken into account as best as possible in the serovar prediction logic.

Serogroup is predicted rather than the O-antigens since serogroup is as precise as you can get and even then there are ambiguities. For example, you can't easily tell apart the following serogroups from one another using molecular methods or nucleotide sequences for wzx and wzy:

  • E1, E4
  • A, D1, D2 (see :ref:fig_wzx_A_D1_D2)
  • C1, F
  • S, O62

So telling apart a '1,40' from a '40' (or other minor differences between O-antigens from the same serogroup) would be even harder than telling the above serogroups from one another. If there is a close relative in the database, then the cgMLST can potentially inform what the full serovar/antigenic formula is.

image

Heatmap of BLAST percent identity of wzx gene alleles from genomes identified as part of serogroups A, D1 or D2. Alleles from different serogroups are very similar to one another (>99% identity).

This is why SISTR uses the genomic distance of your input genomes to a database of Salmonella genomes to narrow down the serovar prediction.

Hope that answers your question!

alantsangmb commented 6 years ago

Hi, @peterk87 . Thank you so much for answering my questions. If I understand correctly, the report shows C1 for "serogroup" because SISTR takes the wzx gene prediction as it shows better Blast result. However, I am still confused why it only shows "11" in the column of "o_antigen".

peterk87 commented 6 years ago

Hi @alantsangmb

If I understand correctly, the report shows C1 for "serogroup" because SISTR takes the wzx gene prediction as it shows better Blast result.

Yes, we've found the results from the wzx to be better for correctly predicting the serogroup.

However, I am still confused why it only shows "11" in the column of "o_antigen".

The O-antigen is inferred from the overall serovar prediction by looking up what the O-antigen is for the serovar in the serovar information table (Salmonella-serotype_serogroup_antigen_table-WHO_2007.csv on https://figshare.com/articles/sistr_cmd_v1_0_2_serotyping_databases/6615938).

alantsangmb commented 6 years ago

I got it. Thank you!