phac-nml / genomic_address_service

Package for clustering sequences de novo and assignment to existing nomenclature
Apache License 2.0
1 stars 2 forks source link

Consistency checking between profile_dists and gas distance units #8

Open apetkau opened 5 months ago

apetkau commented 5 months ago

Issue

Right now gas call or gas mcluster relies on distances calculated from profile_dists. Distance thresholds are set using the --threshold parameter. However, profile_dists can give distances in two different units: either scaled (number from 0 to 1) or hamming (non-negative number). The distance units passed as thresholds need to be kept in-sync. For example:

Solution

One solution to help with error checking is to add a --distm method to gas, that takes either hamming or scaled (same values as passed to profile_dists). This parameter is used to check numbers passed to --threshold

apetkau commented 5 months ago

Other consistency checks could be:

Not sure if it would add a lot more time for consistency checking of calculated distance values though.

apetkau commented 5 months ago

Alternatively, you could change the column output from profile_dists to include the unit. That is:

query_id ref_id dist_scaled
A A 0
A B 0.5

And then passing --distm scaled to gas would only read from the dist_scaled column (and same for --distm hamming).