timkahlke / BASTA

Basic Sequence Taxonomy Annotator
GNU General Public License v3.0
38 stars 13 forks source link

Specie level not reached #8

Closed maxibor closed 6 years ago

maxibor commented 6 years ago

Dear @timkahlke , I've been trying out BASTA on simulated data, however, I can never get down to the specie level: Here is an example of my blast output:

tmp19   NC_029448.1 91.67   48  4   0   53  100 9950    9997    2e-10   67.6
tmp19   NC_029330.1 91.30   46  4   0   54  99  10854   10899   3e-09   63.9
tmp19   NC_023799.1 91.30   46  4   0   54  99  9948    9993    3e-09   63.9
tmp19   NC_022507.1 90.00   50  4   1   51  100 9961    10009   3e-09   63.9
tmp20   NC_035317.1 100.00  100 0   0   1   100 60015   60114   5e-46   185
tmp21   NC_035995.1 100.00  100 0   0   1   100 24700   24799   5e-46   185
tmp21   NC_029485.1 100.00  100 0   0   1   100 23785   23884   5e-46   185
tmp21   NC_028523.1 100.00  100 0   0   1   100 24181   24280   5e-46   185

For the sequence tmp20, there is only one hit, so I should be able to go down the specie level, since the full taxonomic lineage is known for NC_035317.1 However, BASTA only goes to the genus level:

tmp20   Eukaryota;Streptophyta;Liliopsida;Alismatales;Hydrocharitaceae;Stratiotes;

Here is the basta command line I used:

basta sequence blast_results_100.out basta_results_100.out gb -m 1 -n 10 -i 99
maxibor commented 6 years ago

I think I found the source of this issue: In the basta/TaxTree.py file, in the method _get_known_strings() of the class TTree

It seems that you originally planned to include the species, but eventually decided not to...

# remove species
    def _get_known_strings(self,string):
        # ts = string.split(";")[:-2] #original code
        ts = string.split(";")[:-1] #modified version
        return ts

Changing string.split(";")[:-2] to string.split(";")[:-1] solves the issue and allows to recover the specie. I can make a PR if you want, but I feel like it was a development choice...

timkahlke commented 6 years ago

You're right, I wanted to limit false "very specific" assignments at one point but don't remember why. There might also be something about downstream problems for species "unknown". Will have to look into it.

timkahlke commented 6 years ago

I think the reason was that in previous versions I removed taxonomy strings with "unknown" in it. However, there are quite a few taxa with only species level "unknown" so I decided to remove the species and be fine with it. However, I'm not removing those strings anymore anyways so I added the species again ...