Classifier assigning part of some sequence names to "root" #51

Closed tyleraland closed 8 years ago

tyleraland commented 8 years ago

I have a sequence that returns many blast hits between 97-98 pident. Some details references are used in the final assignment while the rest are condensed to "root". I'm not sure if this is the expected behavior or not, but this looks odd and I wanted to reproduce it.

database: RDP 11_4 hits (one qseqid): /mnt/disk2/molmicro/working/tland9/2016-02-02_capture/1629-blast-hits seq_info: /molmicro/common/rdp/11_4/rdp/11_4.0/tax_filter/filtered/seq_info.csv taxonomy: /molmicro/common/rdp/11_4/rdp/11_4.0/tax_filter/filtered/taxonomy.csv

Classifier command: bioy classifier 1629-blast-hits /molmicro/common/rdp/11_4/rdp/11_4.0/tax_filter/filtered/seq_info.csv <(csvcut -c tax_id,tax_name,rank,root,kingdom,phylum,class,order,family,genus,species /molmicro/common/rdp/11_4/rdp/11_4.0/tax_filter/filtered/taxonomy.csv) --has-header --specimen 168_35 --out classify_out --details-out details_out

Classification output:

$ csvlook classify_out
|  specimen | assignment_id | assignment                    | max_percent | min_percent | min_threshold | best_rank | reads | clusters | pct_reads  |
|  168_35   | 0             | Cynomorium;Stigeoclonium;root | 97.96       | 97.22       | 97.00         | root      | 1     | 1        | 100.00     |

Details output (notice all of the condensed_id == 1):

$ csvcut -c "tax_name,assignment_tax_name,assignment_rank,pident,tax_id,assignment_tax_id,condensed_id" details_out | csvlook
|  tax_name                        | assignment_tax_name | assignment_rank | pident | tax_id  | assignment_tax_id | condensed_id  |
|  Actinomadura hallensis          | Actinomadura        | genus           | 97.44  | 337895  | 1988              | 1             |
|  Arthrobacter sulfonivorans      | Arthrobacter        | genus           | 97.37  | 121292  | 1663              | 1             |
|  Arthrobacter sulfureus          | Arthrobacter        | genus           | 97.37  | 43666   | 1663              | 1             |
|  Bacillus cereus                 | Bacillus            | genus           | 97.67  | 1396    | 1386              | 1             |
|  Bacillus licheniformis          | Bacillus            | genus           | 97.67  | 1402    | 1386              | 1             |
|  Bacillus pumilus                | Bacillus            | genus           | 97.67  | 1408    | 1386              | 1             |
|  Bacillus pumilus                | Bacillus            | genus           | 97.67  | 1408    | 1386              | 1             |
|  Bacillus pumilus                | Bacillus            | genus           | 97.67  | 1408    | 1386              | 1             |
|  Bacillus subtilis               | Bacillus            | genus           | 97.67  | 1423    | 1386              | 1             |
|  Candidatus Halomonas phosphatis | Halomonas           | genus           | 97.96  | 1107859 | 2745              | 1             |
|  Cynomorium coccineum            | Cynomorium          | genus           | 97.56  | 51503   | 51502             | 51502         |
|  Dermabacter hominis             | Dermabacter         | genus           | 97.37  | 36740   | 36739             | 1             |
|  Enterococcus faecium            | Enterococcus        | genus           | 97.67  | 1352    | 1350              | 1             |
|  Klebsiella oxytoca              | Klebsiella          | genus           | 97.96  | 571     | 570               | 1             |
|  Klebsiella oxytoca              | Klebsiella          | genus           | 97.96  | 571     | 570               | 1             |
|  Megasphaera elsdenii            | Megasphaera         | genus           | 97.67  | 907     | 906               | 1             |
|  Methylomonas aurantiaca         | Methylomonas        | genus           | 97.96  | 39771   | 416               | 1             |
|  Methylopila capsulata           | Methylopila         | genus           | 97.22  | 61654   | 61653             | 1             |
|  Methylorhabdus multivorans      | Methylorhabdus      | genus           | 97.22  | 61656   | 61655             | 1             |
|  Mycobacterium angelicum         | Mycobacterium       | genus           | 97.44  | 470074  | 1763              | 1             |
|  Sinorhizobium meliloti          | Sinorhizobium       | genus           | 97.96  | 382     | 28105             | 1             |
|  Stenotrophomonas rhizophila     | Stenotrophomonas    | genus           | 97.96  | 216778  | 40323             | 1             |
|  Stigeoclonium helveticum        | Stigeoclonium       | genus           | 97.37  | 55999   | 55998             | 55998         |
|  Streptococcus agalactiae        | Streptococcus       | genus           | 97.67  | 1311    | 1301              | 1             |
crosenth commented 8 years ago

Cynomorium and Stigeoclonium are both genus level classifications. The taxonomy of the rest looks like this:

| tax_id  | tax_name                        | rank    | root | kingdom | phylum | class  | order  | family | genus | species |
| 337895  | Actinomadura hallensis          | species | 1    |         | 201174 | 1760   | 85012  | 2012   | 1988  | 337895  |
| 121292  | Arthrobacter sulfonivorans      | species | 1    |         | 201174 | 1760   | 85006  | 1268   | 1663  | 121292  |
| 43666   | Arthrobacter sulfureus          | species | 1    |         | 201174 | 1760   | 85006  | 1268   | 1663  | 43666   |
| 1396    | Bacillus cereus                 | species | 1    |         | 1239   | 91061  | 1385   | 186817 | 1386  | 1396    |
| 1402    | Bacillus licheniformis          | species | 1    |         | 1239   | 91061  | 1385   | 186817 | 1386  | 1402    |
| 1408    | Bacillus pumilus                | species | 1    |         | 1239   | 91061  | 1385   | 186817 | 1386  | 1408    |
| 1423    | Bacillus subtilis               | species | 1    |         | 1239   | 91061  | 1385   | 186817 | 1386  | 1423    |
| 1107859 | Candidatus Halomonas phosphatis | species | 1    |         | 1224   | 1236   | 135619 | 28256  | 2745  | 1107859 |
| 36740   | Dermabacter hominis             | species | 1    |         | 201174 | 1760   | 85006  | 85020  | 36739 | 36740   |
| 1352    | Enterococcus faecium            | species | 1    |         | 1239   | 91061  | 186826 | 81852  | 1350  | 1352    |
| 571     | Klebsiella oxytoca              | species | 1    |         | 1224   | 1236   | 91347  | 543    | 570   | 571     |
| 907     | Megasphaera elsdenii            | species | 1    |         | 1239   | 909932 | 909929 | 31977  | 906   | 907     |
| 39771   | Methylomonas aurantiaca         | species | 1    |         | 1224   | 1236   | 135618 | 403    | 416   | 39771   |
| 61654   | Methylopila capsulata           | species | 1    |         | 1224   | 28211  | 356    | 31993  | 61653 | 61654   |
| 61656   | Methylorhabdus multivorans      | species | 1    |         | 1224   | 28211  | 356    | 45401  | 61655 | 61656   |
| 470074  | Mycobacterium angelicum         | species | 1    |         | 201174 | 1760   | 85007  | 1762   | 1763  | 470074  |
| 382     | Sinorhizobium meliloti          | species | 1    |         | 1224   | 28211  | 356    | 82115  | 28105 | 382     |
| 216778  | Stenotrophomonas rhizophila     | species | 1    |         | 1224   | 1236   | 135614 | 32033  | 40323 | 216778  |
| 1311    | Streptococcus agalactiae        | species | 1    |         | 1239   | 91061  | 186826 | 1300   | 1301  | 1311    |

The phylum column contains 4 different tax_ids which means a default --max-group-size 3 bumps the last classification to "root". The other two genus assignments are awkwardly forced artifacts of the algo. If you play with the --max-group-size you would get a better result.

The overall low specificity assignment can be expected given the short read lengths, low hit coverage and low pident of the Blast hits:

| qseqid                                       | sseqid     | pident | qstart | qend | qlen | qcovs |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000000777 | 94.55  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000009620 | 94.55  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000628083 | 94.55  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000628084 | 94.55  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002165164 | 92.73  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000495026 | 92.73  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001294523 | 92.73  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000628080 | 92.73  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004226555 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000902853 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004226556 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000004557 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001242189 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000996434 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066145 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066147 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066146 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066149 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066148 | 91.82  | 1      | 110  | 250  | 44.00 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002965784 | 90.57  | 5      | 110  | 250  | 42.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002963222 | 90.62  | 15     | 110  | 250  | 38.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001795518 | 97.96  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S003611774 | 97.96  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S003289772 | 97.96  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000776263 | 97.96  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000012740 | 97.96  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000967118 | 97.96  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000399427 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000541623 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000541624 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000540588 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000540589 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000436251 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002232091 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000351163 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004056772 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004056755 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002916791 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S003286531 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066246 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066244 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004066245 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000623308 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000540590 | 95.92  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004223170 | 94.12  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000434759 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001155653 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001353096 | 93.88  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S003312934 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002034309 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000444934 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001020487 | 95.65  | 2      | 47   | 250  | 18.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001020484 | 95.65  | 2      | 47   | 250  | 18.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S004230775 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000641690 | 95.65  | 2      | 47   | 250  | 18.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001353095 | 93.88  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001353094 | 93.88  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S003721966 | 91.84  | 2      | 50   | 250  | 19.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001589960 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001199566 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000926647 | 97.67  | 2      | 44   | 250  | 17.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000641717 | 97.56  | 10     | 50   | 250  | 16.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000925775 | 97.44  | 2      | 40   | 250  | 15.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000633544 | 97.44  | 2      | 40   | 250  | 15.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S002352232 | 97.37  | 2      | 39   | 250  | 15.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001020535 | 97.37  | 2      | 39   | 250  | 15.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000009616 | 97.37  | 2      | 39   | 250  | 15.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S003611785 | 97.37  | 2      | 39   | 250  | 15.20 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001093959 | 94.87  | 2      | 40   | 250  | 15.60 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000386911 | 97.22  | 2      | 37   | 250  | 14.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S000022501 | 97.22  | 2      | 37   | 250  | 14.40 |
| M03029:113:000000000-ALUWU:1:2102:18344:1629 | S001875319 | 94.74  | 2      | 39   | 250  | 15.20 |