takaram / kofam_scan

CLI tool to annotate genes with KOfam
https://www.genome.jp/tools/kofamkoala/
MIT License
66 stars 11 forks source link

[Question] What criteria are used to determine significant HMMER hits? #32

Open jolespin opened 11 months ago

jolespin commented 11 months ago

I'm looking at the code here but not very familiar w/ Ruby:

module KofamScan
  class Result
    extend Autoload

    autoload :WithEvalueThreshold
    autoload :WithThresholdScale
    autoload :WithThresholdScaleAndEvalueThreshold

    def self.create(query_list, threshold_scale: nil, e_value_threshold: nil)
      if threshold_scale && e_value_threshold
        WithThresholdScaleAndEvalueThreshold.new(query_list, threshold_scale, e_value_threshold)
      elsif e_value_threshold
        WithEvalueThreshold.new(query_list, e_value_threshold)
      elsif threshold_scale
        WithThresholdScale.new(query_list, threshold_scale)
      else
        Result.new(query_list)
      end
    end

Here's what ko_list looks like:

knum    threshold   score_type  profile_type    F-measure   nseq    nseq_used   alen    mlen    eff_nseq    re/pos  definition
K00001  357.90  domain  all 0.256163    2367    1915    1975    464 14.05   0.590   alcohol dehydrogenase [EC:1.1.1.1]
K00002  443.17  full    all 0.430391    2376    2273    5878    503 7.51    0.590   alcohol dehydrogenase (NADP+) [EC:1.1.1.2]
K00003  286.37  domain  all 0.945500    6268    5369    3257    782 7.03    0.590   homoserine dehydrogenase [EC:1.1.1.3]
K00004  369.60  domain  trim    0.809403    1572    1320    1364    436 5.41    0.590   (R,R)-butanediol dehydrogenase / meso-butanediol dehydrogenase / diacetyl reductase [EC:1.1.1.4 1.1.1.- 1.1.1.303]
K00005  320.93  full    all 0.984022    1449    1051    682 366 1.99    0.590   glycerol dehydrogenase [EC:1.1.1.6]
K00006  316.47  full    all 0.899202    2118    1971    3274    549 4.64    0.590   glycerol-3-phosphate dehydrogenase (NAD+) [EC:1.1.1.8]
K00007  520.80  full    all 0.998236    358 283 834 469 1.01    0.589   D-arabinitol 4-dehydrogenase [EC:1.1.1.11]
K00008  420.17  full    all 0.500025    3737    3281    3597    524 8.29    0.590   L-iditol 2-dehydrogenase [EC:1.1.1.14]
K00009  147.27  full    trim    0.997067    1556    1192    1443    459 3.28    0.590   mannitol-1-phosphate 5-dehydrogenase [EC:1.1.1.17]

I'm seeing a score threshold but not e-value threshold. Can you elaborate on the scheme used for determining significant hits?

PfisterMaxJLU commented 7 months ago

With the threshold values from the "ko_list" our group build a little helper script for the formatting and enforcement of KO cut-off thresholds in hmmsearch output files.

We discard the rare (≈1%) secondary (lower score) K assignments so keep an eye out for that. Depending on your use case, this might not be desirable.

Disregarding the dual assignments, the resulting output is functionally identical (>99.999% for our data/settings) to Kofamscan.

I uploaded a code example, which you can modify for your case. I hope this helps.

https://github.com/PfisterMaxJLU/kofamscan_cutoff