Weighted scores for sorting instead of sub-sorting

hepcat72 commented 5 years ago

Since a depth score was added in the sorting logic, we now have 3 scores to base the sort on: degree of adequate depth, degree of separation in observation ratios (by edge or mean distance), and ratio of resolution in genotype calls. If a depth of 10 is adequate, should all groups including 1 sample with a depth of 9 (but a ratio of resolution of 1) be sorted under a pair of groups with full adequate depth and a ratio of resolution of 0.5? No, I don't think so. That's what we get when we sub-sort.

Instead, if we sum weighted scores and then normalize them, we can address the issue demonstrated above. If we rely heavily on genotype calls (assuming they take depth into account), we should heavily weight the ratio of resolution score, then weight depth adequacy, then the separation gap. Perhaps 100:10:1? Maybe 2:1:1. Coming up with a default seems difficult, but it seems obvious that genotype should be the most heavily weighted.

A couple things to consider are that a subgroup of the final groups could score better than a group that meets the various threshold criteria. Also, there may not be genotype calls or they might be the wrong ploidy, in which case you'd want to weight them by 0.

hepcat72 commented 5 years ago

The current weighting is akin to 10000 : 100 : 1. Actually, depth is only evaluated as 0 or >0 and is the primary sort.

hepcat72 commented 5 years ago

I could sub-sort on chromosome & coordinate.

hepcat72 commented 5 years ago

The user should be able to assign weights to the 3 score types to produce an overall score that includes: GT ratio of resolution, OR gap score, and DP score. I should probably rename the scores to reflect "ratio of resolution", "gap/separation", and "depth adequacy". Sorting should be based on this score. I would have to also create additional options for the weights and a separate threshold for GT ratio of resolution (apart from -a). I could also add options for how to treat "no data" in the context of GT and DP scores the way I do for OR using the gap-measure (edge or mean). E.g. GT could grow and be filtered based on a hard cutoff of no data or no mixing of calls (even if the union of 1 group differs from the other. And I could (instead of not add samples under the min-depth, add them as long as the average depth of a sample group is above the min depth score.

hepcat72 commented 5 years ago

[x] Add an option or options to assign sorting weights to GT, OR, and DP scores
[x] Sort based on an aggregate score calculated using those weights
[x] Make sorting weights different in genotype versus no-genotype mode
[x] Update usage/help
[x] Update README
[x] Update tests

hepcat72 commented 5 years ago

Merged

hepcat72 / vcfSampleCompare

Weighted scores for sorting instead of sub-sorting #19