nhoffman / bioy

Tools for NGS sequence analysis and bacterial classification
GNU General Public License v3.0
0 stars 0 forks source link

don't rely on data/rank_thresholds.csv #43

Open nhoffman opened 8 years ago

nhoffman commented 8 years ago

Problem: tax_ids missing from the version of the file rank_thresholds.csv that is distributed with the package will never be included in a classification because the corresponding threshold will be NA (this is seen after joining blast_results with rank_thresholds around line 650 of classifier.py).

We need to do a few things to address this:

At this point we should focus on implementing the above as opposed to refactoring tax2thresholds

crosenth commented 8 years ago

Summary dev steps:

  1. The default thresholds will be defined and input using this file and format: (https://github.com/nhoffman/bioy/blob/master/bioy_pkg/data/rank_threshold_defaults.csv).
  2. Users can provide additional rank_thresholds with the --rank-thresholds argument and same four column file format.
  3. Users would define default behavior by rank and taxon (via tax_id). For example, a global default taxonomic threshold would be defined by "root" and propagated to any member of the "root" lineage. Users can override a global, root threshold by defining more specific taxon. For example, a user could additionally define overriding genus level thresholds for all Bacillus species.
  4. classifier.py should exit with an error if any thresholds are missing after the above operations or any thresholds input is invalid.

tax2thresholds will be DEPRECATED

Previously we corrected a rank_thresholds bug in v1.9 associated with issue https://github.com/nhoffman/bioy/issues/32. All users should upgrade to at least 1.9.1 (or preferably the latest tagged version of bioy) to avoid this bug.