Base the separation gap on max/min difference instead of the difference in the averages

hepcat72 commented 5 years ago

I could probably base the behavior on an option

hepcat72 commented 5 years ago

One problem: If the groups are user-defined, there may not be a gap between the values. That's probably why I chose to use the averages in the first place...

So here's how I might solve that...

When growing the groups, I have to start with 1 sample in each and unless everything is the same value, there has to be a gap. I can just stop growing if the gap becomes negative or I can force the minimum group size and end up with a negative gap size. If we haven't reached the minimum size, the group would either get filtered or would end up sorted at the bottom with a negative score.

I just looked at the code and it appears that the minimum group is created regardless of the presence of a gap. That makes sense. If the user entered a minimum group size, that should always be created and the score should reflect whether the groups are bad or not, and the gap would indicate that. So I will always create the groups of the minimum size.

If I always create groups of the minimum size, then I need to think about what max/min difference means when the observation ratios overlap. Like, what if one group's ratios surround another group's ratios? What if there are outer ratios that differ, but the inner ratios pass each other (e.g. group1: 0.0,0.6 and group2: 0.3,1.0)?

I think if one group surrounds another, they're indistinguishable, thus the score should be -1. If there is separation of some values, but mixing of others, the gap should be a negative (absolute) difference between the outer overlapping values, e.g. -0.3 in the above case. One way of looking at that is that that would represent the amount they'd have to move apart to start to create a gap.

Now, I could try any use some stats to account for outliers, but I think that could be a separate issue. Besides, the intent of this tool is not to look for judgement call differences, but clear differences.

hepcat72 commented 5 years ago

Another issue... Given that I plan to penalize depths below a threshold (#12), negative scores would be improved if I multiply negative scores by a fraction between 0 and 1. And if I plan to penalize no data (like in issue #10), it would have the same effect on negative values...

I think that this means I should work out those issues first because I don't want to account for that in this change and then end up with a different penalty metric and have to redo this.

hepcat72 commented 5 years ago

OK, I worked out the penalization issue by having 3 sorting metrics defined in issue #13. This means that this should be the first issue I should tackle because there won't be an allelic frequency penalty applied directly to this score.

hepcat72 commented 5 years ago

Finished

hepcat72 / vcfSampleCompare

Base the separation gap on max/min difference instead of the difference in the averages #9