phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

Sistr cmd update 2018 #35

Closed jrober84 closed 5 years ago

jrober84 commented 5 years ago

Significant changes to SISTR antigen biomarker and cgMLST database and processing of flagella antigens

jrober84 commented 5 years ago

Thanks @apetkau for the review, these were very helpful suggestions. I have fixed the handling of 0 fasta files in sistr_cmd in commit 3a81ab2 and resolved the warnings in 55077d9.

peterk87 commented 5 years ago

Hi @apetkau and @jrober84

Just to comment on the cgMLST ST and allele number issue @apetkau highlighted:

It looks like the majority of the SISTR cgMLST sequence types will change (not sure if this is expected)?

When working on the cgMLST DB for v1.0.2, I found that curation of what alleles you allow into the cgMLST scheme to be paramount to ensuring the reproducibility of the cgMLST results over time especially when adding alleles from large numbers of genomes. The cgMLST Sequence Type (ST) is just a md5sum hash of a complete cgMLST allelic profile where the cgMLST allele numbers are just a md5sum hash of the underlying sequence of the allele. So the cgMLST ST is especially sensitive to slight changes to the allele sequences.

For v1.0.2, I ended up looking through the MSAs for the medoid alleles (previously called them centroid alleles erroneously) and removing anything that visually looked like it was introducing too much variation (i.e. throwing the alignment out of alignment). I re-ran the reduced set of alleles with sistr_cmd against a set of reference genomes to produce a new cgMLST DB. I wouldn't recommend doing something like this again since it's a very manual process. Maybe only using alleles that are okay by sistr_cmd v1.0.2 in the new DB with some quick checks to ensure that no new alleles are out of alignment with the existing alleles?

It may be a good idea to pass the cgMLST FASTA DB through something like db-check to remove potentially problematic alleles.

Mostly unchanging cgMLST allele numbers and STs from v1.0.2 to v1.1.0 would be a great feature of sistr_cmd. In theory, the allele sequences found by SISTR between versions should be the same for each marker so the allele numbers should also be the same. But there's probably errors in v1.0.2 that have been fixed in v1.1.0 or slightly longer alleles that have been introduced for some markers that match better by sistr_cmd's metrics.

Another thing I was wondering about is whether allelic profile results between MacOS and Linux are comparable. That is, does the following function return the same value on different OSes and platforms (see allele_name source code)?

zlib.crc32(seq) & 0xffffffff

Hope that helps!

apetkau commented 5 years ago

Thanks for the additional info @peterk87. Changing allele numbers due to corrected errors or longer longer alleles makes sense as to why the cgMLST sequence types would change. Especially since the number is so sensitive to differences (if any allele changes at all, the sequence type changes).

I agree it would be very nice to have the cgMLST allele numbers mostly consistent. But, I also realize that this could involve a lot of work. And our primary purpose for SISTR is still serotyping (rather than cgMLST). So, I would be okay if these change (so long as serotypes are consistent). When this is integrated into IRIDA, the changes will all appear under a brand new column in our metadata table anyways (for SISTR 1.1.0), so results won't be mixed up.

However, many other people use SISTR as well. And I don't know how much they rely on the cgMLST numbers being consistent?

At a minimum, I think we should make it very clear that cgMLST types will change across versions of SISTR (e.g., in a CHANGELOG.md file, and in our release notes).

What are your thoughts @jrober84 ?

Another thing I was wondering about is whether allelic profile results between MacOS and Linux are comparable. That is, does the following function return the same value on different OSes and platforms (see allele_name source code)?

I think CRC32 is used a lot for detecting errors in data (e.g., during transmission across networks). It's also used for error checking in zip archives. So, I suspect it should be consistent across different platforms (otherwise zipped files on Linux would break on MacOS or Windows). The Python docs for that function states:

To generate the same numeric value across all Python versions and platforms, use crc32(data) & 0xffffffff. - https://docs.python.org/3/library/zlib.html#zlib.crc32

This is exactly what you do, so I hope it should be consistent. Though, I wonder if there ever could be issues with different ways of encoding text (would this lead to different CRC32 values)? Maybe this is what you meant?

jrober84 commented 5 years ago

Thanks @peterk87 and @apetkau, the change in sequence types is expected as @peterk87 said even a single allele change will change the ST type. The focus from SISTR is on the serotype calls and from the perspective of the majority of users, they are not using the cgMLST information. I agree with Aaron that we should make it clear that the cgMLST types can change between DB release versions and for now that is the best that we can do.

I have fixed the mash issue as it was the result of old data in the sketch which was no longer included in the database. I have added error checking to allow for mismatches between the mash sketch and the genomes-to-serovar table

apetkau commented 5 years ago

Thanks for all the fixes @jrober84.

That sounds good about the cgMLST information changing. Sounds like we're all on the same page.

I'm going to spend a bit more time testing this out before merging, but it overall looks great to me :+1:

apetkau commented 5 years ago

Everything looks great to me. Thanks so much @jrober84