theiagen / public_health_bioinformatics

Bioinformatics workflows for genomic characterization, submission preparation, and genomic epidemiology of pathogens of public health concern.
GNU General Public License v3.0
37 stars 17 forks source link

[TheiaProk] Output salmonella serogroup as additional column (SISTR) #618

Open kapsakcj opened 4 days ago

kapsakcj commented 4 days ago

:cool:

:pushpin: Explain the Request

A lab requested that the Salmonella serogroup is parsed from the SeqSero2 and/or SISTR output files and output as an additional column.

Serogroups are usually one letter: A,B,C,D,E, etc. and can be accompanied by a number (I'm not 100% sure on this, I'm not too familiar with these). I believe these are somehow related/connected to the O-antigen.

This value can be found in the SISTR output TSV in the serogroup column: B or D1 are examples.

Not sure if this info is output from seqsero2, but I'll keep looking

kapsakcj commented 4 days ago

OK looks like seqsero2 does not predict or output this in an obvious way. I just looked through the output files produced when running the tool in -m a microassembly mode as we have default in TheiaProk. Didn't find serogroup anywhere

Likely only available in SISTR output TSV

fraser-combe commented 1 day ago

SeqSero2 provides O-antigen predictions but not directly the serogroup letter. You would have to map O-antigen numbers from seqsero2 to serogroups manually. But we should get the required information from SISTR