Open apetkau opened 5 months ago
Other consistency checks could be:
--distm hamming
, then all distance values from profile_dists (and passed thresholds) should be integers--distm scaled
, then all distance values from profile_dists should be between 0 and 1.Not sure if it would add a lot more time for consistency checking of calculated distance values though.
Alternatively, you could change the column output from profile_dists
to include the unit. That is:
query_id | ref_id | dist_scaled |
---|---|---|
A | A | 0 |
A | B | 0.5 |
And then passing --distm scaled
to gas would only read from the dist_scaled
column (and same for --distm hamming
).
Issue
Right now
gas call
orgas mcluster
relies on distances calculated from profile_dists. Distance thresholds are set using the--threshold
parameter. However,profile_dists
can give distances in two different units: eitherscaled
(number from 0 to 1) orhamming
(non-negative number). The distance units passed as thresholds need to be kept in-sync. For example:profile_dists
usesscaled
, then thresholds need to be from 0 to 1 e.g.,--threshold 0.2,0.1
profile_dists
useshamming
, then thresholds need to be non-negative, e.g.,--threshold 10,5,0
Solution
One solution to help with error checking is to add a
--distm
method togas
, that takes eitherhamming
orscaled
(same values as passed to profile_dists). This parameter is used to check numbers passed to--threshold