Closed hepcat72 closed 5 years ago
This should also cause the AVEDP columns to to be changed to FREQ_SEP_SCORE and for the current SEP_SCORE columns to be changed to GT_SEP_SCORE. Perhaps I should add columns for depth and low depth penalty.
Note, these changes all refer to things in other issues.
I think I will change:
I will add:
"OR" will stand for "observation ratio" and of course "GT" stands for genotype.
Presumably, OR scores could be independent, but I will not be treating them that way. The GT call will take precedence regardless of what the score is. They should be independent though even though it will be treated as dependent. Take diploid calls for example. 0/1 is treated as equally different from 0/0 and 1/1 as 0/0 and 1/1 are to each other, but the OR scores will definitely prioritize 0/0 versus 1/1 as more different than either is from 0/1. Otherwise, GT and OR scores should be related. When ploidy is 3 or more, or when there are multiple SNP states extant, OR becomes less relevant. In order to differentiate between those, we would need a lot of data and some states to see what's real and what's not. For now though, I'm going to keep it simple. Relying on the GT call from an app that presumably has those stats built into the call will have to handle that.
Looks like I will also need to split STATE(S)_USED into:
The GT_STATES_USED will be the GT calls in each of the 2 groups of the best pair separated by a semicolon. Multiple GT calls in 1 group will be joined by '+'.
The OR_STATE_USED will be the variant state number (0=REF, 1...=ALT).
Looks like I also need to split PAIR[12]_SCORE_DATA into:
And I will change PAIR[12]_MEMBERS to:
The usage of "PAIR" was misleading.
Things to modify:
The observation ratios are wrong for one of the samples in each case and its causing the GT states to be mixed:
diff <(cut -f 1,2,3,4,5,6,7,8,10,11,12,14,15,16,17,18,19,20,21 /Users/rleach/PROJECT/GABRIEL/SHEFFLER-VARIOUS20170920-RERUN/ANALYSIS/mapped_to_crescentus/NEW_PIPELINE_RESULTS/SNPS/1-RAW/FreeBayes_SnpEff.vcf.v2_01.edge.a-1.i11.txt) <(cut -f 1,2,3,4,5,6,7,8,10,11,12,14,15,16,17,18,19,20,21 /Users/rleach/PROJECT/GABRIEL/SHEFFLER-VARIOUS20170920-RERUN/ANALYSIS/mapped_to_crescentus/NEW_PIPELINE_RESULTS/SNPS/1-RAW/FreeBayes_SnpEff.vcf.v2_01.edge.a-1.i11.mindp.txt)
123357c123357
< Chromosome 3925317 . G C 1 1 0 1 0 1 0+1;0 0 Caulobacter_spp,Caulobacter_crescentus_strain_CB15 1,0 0/2,0/1 NA1000 0 20/20
---
> Chromosome 3925317 . G C 1 1 0 1 0 1 1+0;0 0 Caulobacter_spp,Caulobacter_crescentus_strain_CB15 1,0 0/2,0/1 NA1000 0 20/20
123480c123480
< Chromosome 3135247 . TCGCAAGGCGGACGAT CCGTAAGCCCGACAAC 1 0.6667 0 1 0 0.6667 0+1;0 0 Caulobacter_spp,Caulobacter_crescentus_strain_CB15 1,0 0/4,1/3 NA1000 0 6/6
---
> Chromosome 3135247 . TCGCAAGGCGGACGAT CCGTAAGCCCGACAAC 1 0.6667 0 1 0 0.6667 1+0;0 0 Caulobacter_spp,Caulobacter_crescentus_strain_CB15 1,0 0/4,1/3 NA1000 0 6/6
Whoops, not a bug. It's an unexpected case. The DP is 1, but AO and RO are both 0 for sample Caulobacter_crescentus_strain_CB15, so the state chosen is the reference state because the NA1000 sample is the only one with that state. The others are the variant 'C' and what I'm guessing is a gap.
OK. I do not need to compute the GT and OR scores independently. Each forms its own group depending on the genotype mode. I just need to take that group and calculate the other score. If run in genotype mode, I get a group and I can find the best OR score using the variant states the same way I do when creating the group in no-GT mode. And when in no-GT mode, I can evaluate the GT score given the resulting group built in no-GT/OR mode.
I should just create a method that gets the best variant state and score and a method that gets the GT score. It also means that I do not need to re-write the create*SampleGroupPair, sampleGroupPairPasses, or growAPair methods.
New column headers:
#CHROM POS ID REF ALT BEST_PAIR BEST_GT_SCORE BEST_OR_SCORE BEST_MINDP PAIR_NUM PAIR_GT_SCORE PAIR_OR_SCORE PAIR_MINDP STATES_USED_GT STATE_USED_OR GROUP1_SAMPLES GROUP1_GTS GROUP1_ORS GROUP2_SAMPLES GROUP2_GTS GROUP2_ORS
merged
Sub-sort is currently using average depth, which is fairly meaningless at adequate depths. Instead, I should allow the sub-sort to be based on allelic frequency, (when --genotype is true).
I would have to change the help, README, and potentially the usage to reflect the change.
If the user supplies --nogenotype, allelic frequency would become the primary sort and there would be no secondary sort.