Closed apetkau closed 4 years ago
This has been partially addressed with the new bioconda release (https://github.com/bioconda/bioconda-recipes/pull/13441) which fixes blast at 2.5.0
. I'm still going to leave this open for now though.
We are now at BLAST 2.10.0
def blast_against_query(self, query_fasta_path, blast_task='megablast', evalue=1e-20, min_pid=85):
the defaults are all used as far as i can tell.
Thanks for letting me know @tseemann. I believe for the latest release of SISTR (1.1.0) I left the conda recipe fixed to BLAST 2.5.x. Though it should work with any later versions of BLAST.
I believe the main issue was that the order of some of the BLAST results must have been different between different versions of BLAST (or at least SISTR was reading the results in a different order). It's been a while since I tested this part out though so I'm still not sure if this issue is completely solved yet.
I had tested this two days ago and I could replicate the different ST types with different versions of blast in v. 1.1.0. So this issue will need to stay open for now. The biomarker results are being sorted by bitscore and are consistent between blast versions, so I think just applying the same approach to the cgMLST should make the results consistent between blast versions. I will take a look and see if this will be a simple change.
Awesome :+1:. Thanks for testing this out @jrober84
Preliminary testing of branch: cgmlst_sorting shows consistent cgMLST ST assignment between versions of blast. All that was required was inserting a sort by bitscore prior to the other operations. @apetkau can you test this branch on your end? Using your genome SRR3028749-shovill-contigs.fasta.gz I get 3843671596 as the cgmlst_st between both versions of blast.
Modern BLAST has new options will could be useful:
-sorthits <Integer, (>=0 and =<4)>
Sorting option for hits:
alignment view options:
0 = Sort by evalue,
1 = Sort by bit score,
2 = Sort by total score,
3 = Sort by percent identity,
4 = Sort by query coverage
-qcov_hsp_perc <Real, 0..100>
Percent query coverage per hsp
-culling_limit <Integer, >=0>
If the query range of a hit is enveloped by that of at least this many
higher-scoring hits, delete the hit
* Incompatible with: best_hit_overhang, best_hit_score_edge
Thanks @tseemann , I didn't know about those new features. It might be worth adding those in and forcing people to use a newer version of blast.
Considering the latest conversation here, I was wondering if the pinning of blast in the bioconda recipe for sistr_cmd can be relaxed? I have a Docker image with sistr_cmd=1.1.0 and prokka=1.14.6 which can not built since prokka requires blast>=2.7.1.
I would be more than happy to update the bioconda recipe if you all agree on that.
Thank you for the comment @npavlovikj. Updating sistr fell off my task list for a while.
The final code that fixes this issue is actually in here https://github.com/phac-nml/sistr_cmd/pull/43 (can you confirm this @jrober84 ?).
Once that's merged in I'll have to make a new release in PyPI before it's updated in conda. I would certainly welcome any help with bioconda recipes 😄
Also I had tested this out between versions and I couldn't find any inconsistent cgMLST results with the datasets I tried. So I believe this problem is fixed.
@apetkau , thank you so much for your prompt reply. I am glad to hear that the issue is fixed. Once you make a new PyPI release, the Bioconda bot will automatically detect and build the newer version with the current recipe. Someone will just need to make sure the blast pinning is removed, so feel free to ping me to do that when the new version is released.
@apetkau yes the inconsistency has been fixed so it should be good to go
The fix has been merged in (#43) and has been released as version 1.1.1
. It is now in PyPI and Bioconda.
Issue
I've been investigating an issue where I'm getting different cgMLST types for an assembly depending on the underlying version of blast that SISTR depends on. That is:
2.2.31
2.5.0
Details
I explored the data a bit further and it looks like the difference comes down to a difference for an allele for NZ_AOXE01000059.1_363. I suspect these differences are due to a different order of the hits in the blast output file.
blast 2.2.31
Allele 1932373744.
blast 2.5.0
Allele 4175098203.
Reproduction
To reproduce this issue you can run the following with the the assembly SRR3028749-shovill-contigs.fasta.gz.
blast 2.2.31
blast 2.5.0
Solution
There's two solutions I can think of. The first and quicker one I will likely do now. The later one can probably be incorporated into the new version of SISTR.
Pin version of blast in bioconda
It looks like the version of blast is not pinned down in the current bioconda recipie.
I think for now, I will change this to a specific version of blast so that the results are consistent.
Sort order of blast results
I think a more permanent solution would be to sort the order of the blast results in SISTR so that they are consistent across blast versions. I'd need to do a bit more testing though to make sure this would actually fix our problem (or, maybe the blast results are already sorted, in which case I'm not sure why there are differences). I think this can be left for our new release of SISTR.
Do you have any additional thoughts @jrober84 or @peterk87.