nhoffman / bioy

Tools for NGS sequence analysis and bacterial classification
GNU General Public License v3.0
0 stars 0 forks source link

Feature Request, Classifier: keep the best N classification names #46

Closed tyleraland closed 8 years ago

tyleraland commented 8 years ago

Noah/Steve requested additional reference sequence filtering for winnowing the names that appear in slash names: Sort references by number of mismatches from query and keep the top N names, where N is provided by the user and defaults to infinity.

Example: A sequence would normally be classified as A / B / C, where mismatches with A,B,C are 0,1,2 respectively. If given --closest 2, then C would be thrown away and our assigned name would be A* / B.

nhoffman commented 8 years ago

To add to the specification: the threshold applies to reference sequences, not tax_ids. For example:

| hit | mismatches | species |
|-----+------------+---------|
|   1 |          0 | A       |
|   2 |          3 | A       |
|   3 |          0 | B       |
|   4 |          2 | B       |
|   5 |          2 | C       |
|   6 |          2 | C       |

if --closest=1, keep hits {1, 3} and the classification is A/B if --closest=2, keep hits {1, 3, 4, 5, 6} and the classification is A/B/C

tyleraland commented 8 years ago

Appears in 1.12