phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

Different cgMLST types depending on version of BLAST #34

Closed apetkau closed 4 years ago

apetkau commented 5 years ago

Issue

I've been investigating an issue where I'm getting different cgMLST types for an assembly depending on the underlying version of blast that SISTR depends on. That is:

blast version cgMLST ST
2.2.31 1326888805
2.5.0 19368103

Details

I explored the data a bit further and it looks like the difference comes down to a difference for an allele for NZ_AOXE01000059.1_363. I suspect these differences are due to a different order of the hits in the blast output file.

blast 2.2.31

Allele 1932373744.

python -m json.tool alleles-2.2.31.json | grep '"NZ_AOXE01000059.1_363":' -A 8

        "NZ_AOXE01000059.1_363": {
            "name": 1932373744,
            "seq": "ATGAGCTTACTCAACGTCCCGGCGGGTAAAGAACTGCCGGAAGATATCTACGTCGTTATCGAGATCCCGGCTAACGCAGATCCGATCAAATACGAAGTTGACAAAGAGAGCGGCGCGCTGTTCGTTGACCGCTTCATGTCCACCGCGATGTTCTATCCGTGCAACTACGGTTACATCAACCATACCCTGTCTCTGGACGGCGACCCGGTAGACGTTCTGGTCCCGACGCCGTACCCGCTGCAGCCAGGCGCCGTCATCCGTTGCCGTCCGGTTGGCGTACTGAAAATGACCGACGAATCCGGTGAAGATGCGAAACTGGTTGCCGTACCGCACACCAAACTGAGCAAAGAGTACGATCACATTAAAGATGTGAACGATCTGCCGGAACTGCTGAAAGCGCAGATCACTCATTTCTTCGAGCATTATAAAGATCTCGAAAAAGGCAAATGGGTGAAAGTTGACGGTTGGGACAACGCCGAAGCGGCTAAAGCGGAAATCGTTGCCTCCTTCGAGCGCGCAGCGAAGAAATAA",
            "blast_result": {
                "qseqid": "NZ_AOXE01000059.1_363|1932373744",
                "stitle": "contig00006 len=239825 cov=20.7 corr=0 spades=NODE_6_length_239825_cov_20.725767_pilon",
                "pident": 100.0,
                "length": 531,
                "mismatch": 0,

blast 2.5.0

Allele 4175098203.

python -m json.tool alleles-2.5.0.json | grep '"NZ_AOXE01000059.1_363":' -A 8

        "NZ_AOXE01000059.1_363": {
            "name": 4175098203,
            "seq": "ATGAGCTTACTCAACGTCCCGGCGGGTAAAGAACTGCCGGAAGATATCTACGTCGTTATCGAGATCCCGGCTAACGCAGATCCGATCAAATACGAAGTTGACAAAGAGAGCGGCGCGCTGTTCGTTGACCGCTTCATGTCCACCGCGATGTTCTATCCGTGCAACTACGGTTACATCAACCATACCCTGTCTCTGGACGGCGACCCGGTAGACGTTCTGGTCCCGACGCCGTACCCGCTGCAGCCAGGCGCCGTCATCCGTTGCCGTCCGGTTGGCGTACTGAAAATGACCGACGAATCCGGTGAAGATGCGAAACTGGTTGCCGTACCGCACACCAAACTGAGCAAAGAGTACGATCACATTAAAGATGTGAACGATCTGCCGGAACTGCTGAAAGCGCAGATCACTCATTTCTTCGAGCATTATAAAGATCTCGAAAAAGGCAAATGGGTGAAAGTTGACGGTTGGGACAACGCCGAAGCGGCTAAAGCGGAAATCGTTGCCTCCTTCGAGCGCGCAGCGAAGAAATAAGTT",
            "blast_result": {
                "qseqid": "NZ_AOXE01000059.1_363|299409283",
                "stitle": "contig00006 len=239825 cov=20.7 corr=0 spades=NODE_6_length_239825_cov_20.725767_pilon",
                "pident": 99.81299999999999,
                "length": 534,
                "mismatch": 1,

Reproduction

To reproduce this issue you can run the following with the the assembly SRR3028749-shovill-contigs.fasta.gz.

blast 2.2.31

conda create --name sistr-blast-2.2.31 sistr_cmd=1.0.2 blast=2.2.31
source activate sistr-blast-2.2.31

sistr -f csv -o predictions-2.2.31.csv -M -p profiles-2.2.31.csv -n novel-2.2.31.fasta -a alleles-2.2.31.json --use-full-cgmlst-db --run-mash --qc -t 1 SRR3028749-shovill-contigs.fasta

cut -d ',' -f 1 predictions-2.2.31.csv # print cgMLST ST

blast 2.5.0

conda create --name sistr-blast-2.5.0 sistr_cmd=1.0.2 blast=2.5.0
source activate sistr-blast-2.5.0

sistr -f csv -o predictions-2.5.0.csv -M -p profiles-2.5.0.csv -n novel-2.5.0.fasta -a alleles-2.5.0.json --use-full-cgmlst-db --run-mash --qc -t 1 SRR3028749-shovill-contigs.fasta

cut -d ',' -f 1 predictions-2.5.0.csv # print cgMLST ST

Solution

There's two solutions I can think of. The first and quicker one I will likely do now. The later one can probably be incorporated into the new version of SISTR.

Pin version of blast in bioconda

It looks like the version of blast is not pinned down in the current bioconda recipie.

I think for now, I will change this to a specific version of blast so that the results are consistent.

Sort order of blast results

I think a more permanent solution would be to sort the order of the blast results in SISTR so that they are consistent across blast versions. I'd need to do a bit more testing though to make sure this would actually fix our problem (or, maybe the blast results are already sorted, in which case I'm not sure why there are differences). I think this can be left for our new release of SISTR.

Do you have any additional thoughts @jrober84 or @peterk87.

apetkau commented 5 years ago

This has been partially addressed with the new bioconda release (https://github.com/bioconda/bioconda-recipes/pull/13441) which fixes blast at 2.5.0. I'm still going to leave this open for now though.

tseemann commented 4 years ago

We are now at BLAST 2.10.0

tseemann commented 4 years ago

https://github.com/phac-nml/sistr_cmd/blob/638148344dabf601fb6251d3186448e64960f63c/sistr/src/blast_wrapper/__init__.py#L111-L121

  def blast_against_query(self, query_fasta_path, blast_task='megablast', evalue=1e-20, min_pid=85):

the defaults are all used as far as i can tell.

apetkau commented 4 years ago

Thanks for letting me know @tseemann. I believe for the latest release of SISTR (1.1.0) I left the conda recipe fixed to BLAST 2.5.x. Though it should work with any later versions of BLAST.

I believe the main issue was that the order of some of the BLAST results must have been different between different versions of BLAST (or at least SISTR was reading the results in a different order). It's been a while since I tested this part out though so I'm still not sure if this issue is completely solved yet.

jrober84 commented 4 years ago

I had tested this two days ago and I could replicate the different ST types with different versions of blast in v. 1.1.0. So this issue will need to stay open for now. The biomarker results are being sorted by bitscore and are consistent between blast versions, so I think just applying the same approach to the cgMLST should make the results consistent between blast versions. I will take a look and see if this will be a simple change.

apetkau commented 4 years ago

Awesome :+1:. Thanks for testing this out @jrober84

jrober84 commented 4 years ago

Preliminary testing of branch: cgmlst_sorting shows consistent cgMLST ST assignment between versions of blast. All that was required was inserting a sort by bitscore prior to the other operations. @apetkau can you test this branch on your end? Using your genome SRR3028749-shovill-contigs.fasta.gz I get 3843671596 as the cgmlst_st between both versions of blast.

tseemann commented 4 years ago

Modern BLAST has new options will could be useful:

 -sorthits <Integer, (>=0 and =<4)>
   Sorting option for hits:
   alignment view options:
     0 = Sort by evalue,
     1 = Sort by bit score,
     2 = Sort by total score,
     3 = Sort by percent identity,
     4 = Sort by query coverage

 -qcov_hsp_perc <Real, 0..100>
   Percent query coverage per hsp

 -culling_limit <Integer, >=0>
   If the query range of a hit is enveloped by that of at least this many
   higher-scoring hits, delete the hit
    * Incompatible with:  best_hit_overhang, best_hit_score_edge
jrober84 commented 4 years ago

Thanks @tseemann , I didn't know about those new features. It might be worth adding those in and forcing people to use a newer version of blast.

npavlovikj commented 4 years ago

Considering the latest conversation here, I was wondering if the pinning of blast in the bioconda recipe for sistr_cmd can be relaxed? I have a Docker image with sistr_cmd=1.1.0 and prokka=1.14.6 which can not built since prokka requires blast>=2.7.1.

I would be more than happy to update the bioconda recipe if you all agree on that.

apetkau commented 4 years ago

Thank you for the comment @npavlovikj. Updating sistr fell off my task list for a while.

The final code that fixes this issue is actually in here https://github.com/phac-nml/sistr_cmd/pull/43 (can you confirm this @jrober84 ?).

Once that's merged in I'll have to make a new release in PyPI before it's updated in conda. I would certainly welcome any help with bioconda recipes 😄

apetkau commented 4 years ago

Also I had tested this out between versions and I couldn't find any inconsistent cgMLST results with the datasets I tried. So I believe this problem is fixed.

npavlovikj commented 4 years ago

@apetkau , thank you so much for your prompt reply. I am glad to hear that the issue is fixed. Once you make a new PyPI release, the Bioconda bot will automatically detect and build the newer version with the current recipe. Someone will just need to make sure the blast pinning is removed, so feel free to ping me to do that when the new version is released.

jrober84 commented 4 years ago

@apetkau yes the inconsistency has been fixed so it should be good to go

apetkau commented 4 years ago

The fix has been merged in (#43) and has been released as version 1.1.1. It is now in PyPI and Bioconda.