Make the sub-sort be on allelic frequency instead of average depth

hepcat72 commented 5 years ago

Sub-sort is currently using average depth, which is fairly meaningless at adequate depths. Instead, I should allow the sub-sort to be based on allelic frequency, (when --genotype is true).

I would have to change the help, README, and potentially the usage to reflect the change.

If the user supplies --nogenotype, allelic frequency would become the primary sort and there would be no secondary sort.

hepcat72 commented 5 years ago

This should also cause the AVEDP columns to to be changed to FREQ_SEP_SCORE and for the current SEP_SCORE columns to be changed to GT_SEP_SCORE. Perhaps I should add columns for depth and low depth penalty.

Note, these changes all refer to things in other issues.

hepcat72 commented 5 years ago

I think I will change:

PAIR_AVEDP to PAIR_MINDP
BEST_AVEDP to BEST_MINDP
BEST_SEP_SCORE to BEST_GT_SCORE
PAIR_SEP_SCORE to PAIR_GT_SCORE

I will add:

BEST_OR_SCORE
PAIR_OR_SCORE

"OR" will stand for "observation ratio" and of course "GT" stands for genotype.

Presumably, OR scores could be independent, but I will not be treating them that way. The GT call will take precedence regardless of what the score is. They should be independent though even though it will be treated as dependent. Take diploid calls for example. 0/1 is treated as equally different from 0/0 and 1/1 as 0/0 and 1/1 are to each other, but the OR scores will definitely prioritize 0/0 versus 1/1 as more different than either is from 0/1. Otherwise, GT and OR scores should be related. When ploidy is 3 or more, or when there are multiple SNP states extant, OR becomes less relevant. In order to differentiate between those, we would need a lot of data and some states to see what's real and what's not. For now though, I'm going to keep it simple. Relying on the GT call from an app that presumably has those stats built into the call will have to handle that.

hepcat72 commented 5 years ago

Looks like I will also need to split STATE(S)_USED into:

GT_STATES_USED
OR_STATE_USED

The GT_STATES_USED will be the GT calls in each of the 2 groups of the best pair separated by a semicolon. Multiple GT calls in 1 group will be joined by '+'.

The OR_STATE_USED will be the variant state number (0=REF, 1...=ALT).

hepcat72 commented 5 years ago

Looks like I also need to split PAIR[12]_SCORE_DATA into:

GROUP1_GTS
GROUP1_ORS
GROUP2_GTS
GROUP2_ORS

And I will change PAIR[12]_MEMBERS to:

GROUP1_SAMPLES
GROUP2_SAMPLES

The usage of "PAIR" was misleading.

hepcat72 commented 5 years ago

Things to modify:

X createDynamicMaxSampleGroupPair
X createStaticMinSampleGroupPair
X sampleGroupPairPasses
[x] growAPair
[x] getBestScoringData (add a second rank in the returned object)
[x] header print in main
[x] data print in main
[x] It would be nice to use the output headers as hash keys when they match
[x] Change the sub names to better reflect their usage
[x] Change ave DP to min DP
[x] Fix the bug described below
[x] Update the README.
[x] Fix tests
[x] Make sure install works

The observation ratios are wrong for one of the samples in each case and its causing the GT states to be mixed:

diff <(cut -f 1,2,3,4,5,6,7,8,10,11,12,14,15,16,17,18,19,20,21 /Users/rleach/PROJECT/GABRIEL/SHEFFLER-VARIOUS20170920-RERUN/ANALYSIS/mapped_to_crescentus/NEW_PIPELINE_RESULTS/SNPS/1-RAW/FreeBayes_SnpEff.vcf.v2_01.edge.a-1.i11.txt) <(cut -f 1,2,3,4,5,6,7,8,10,11,12,14,15,16,17,18,19,20,21 /Users/rleach/PROJECT/GABRIEL/SHEFFLER-VARIOUS20170920-RERUN/ANALYSIS/mapped_to_crescentus/NEW_PIPELINE_RESULTS/SNPS/1-RAW/FreeBayes_SnpEff.vcf.v2_01.edge.a-1.i11.mindp.txt)
123357c123357
< Chromosome    3925317 .   G   C   1   1   0   1   0   1   0+1;0   0   Caulobacter_spp,Caulobacter_crescentus_strain_CB15  1,0 0/2,0/1 NA1000  0   20/20
---
> Chromosome    3925317 .   G   C   1   1   0   1   0   1   1+0;0   0   Caulobacter_spp,Caulobacter_crescentus_strain_CB15  1,0 0/2,0/1 NA1000  0   20/20
123480c123480
< Chromosome    3135247 .   TCGCAAGGCGGACGAT    CCGTAAGCCCGACAAC    1   0.6667  0   1   0   0.6667  0+1;0   0   Caulobacter_spp,Caulobacter_crescentus_strain_CB15  1,0 0/4,1/3 NA1000  0   6/6
---
> Chromosome    3135247 .   TCGCAAGGCGGACGAT    CCGTAAGCCCGACAAC    1   0.6667  0   1   0   0.6667  1+0;0   0   Caulobacter_spp,Caulobacter_crescentus_strain_CB15  1,0 0/4,1/3 NA1000  0   6/6

Whoops, not a bug. It's an unexpected case. The DP is 1, but AO and RO are both 0 for sample Caulobacter_crescentus_strain_CB15, so the state chosen is the reference state because the NA1000 sample is the only one with that state. The others are the variant 'C' and what I'm guessing is a gap.

hepcat72 commented 5 years ago

OK. I do not need to compute the GT and OR scores independently. Each forms its own group depending on the genotype mode. I just need to take that group and calculate the other score. If run in genotype mode, I get a group and I can find the best OR score using the variant states the same way I do when creating the group in no-GT mode. And when in no-GT mode, I can evaluate the GT score given the resulting group built in no-GT/OR mode.

I should just create a method that gets the best variant state and score and a method that gets the GT score. It also means that I do not need to re-write the create*SampleGroupPair, sampleGroupPairPasses, or growAPair methods.

hepcat72 commented 5 years ago

New column headers:

#CHROM POS ID REF ALT BEST_PAIR BEST_GT_SCORE BEST_OR_SCORE BEST_MINDP PAIR_NUM PAIR_GT_SCORE PAIR_OR_SCORE PAIR_MINDP STATES_USED_GT STATE_USED_OR GROUP1_SAMPLES GROUP1_GTS GROUP1_ORS GROUP2_SAMPLES GROUP2_GTS GROUP2_ORS

hepcat72 commented 5 years ago

merged

hepcat72 / vcfSampleCompare

Make the sub-sort be on allelic frequency instead of average depth #11