There is some basic filtering that is done using an AO/DP ratio and an min depth, but there's also the ability to do comparisons of a group of samples to another group of samples to see if they differ. The comparison works like this:
comparison: do N samples of set A differ from M samples of set B
comparison requirement: Either N must represent a majority of A or M must represent a majority of B
Determination of sample state (same as REF or different from REF):
*By AO/DP ratio threshold: for each sample of sets A & B, is it's AO/DP ratio less than or greater than the threshold?
If above the threshold, it's different from REF
diff_state_count incremented for parent set (A or B)
If below the threshold, it's the same as REF
same_state_count incremented for parent set (A or B)
Application of threshold requirement: in order to be subject to the threshold, DB must be greater than or equal to minDP.
Determination of "hit" for comparison:
If A.diff_state_count >= N and B.same_state_count >= M OR if A.same_state_count >= N and B.diff_state_count >= M
hit_count++
The problem with this is that thresholding can cause things to be output that are very close. This is dealt with by sorting so that things that are the most different come to the top. However, you still miss stuff.
Proposal:
Instead of filtering via a threshold, just sort based on degree of difference between compared 2 sets of samples. I could average one set and sum the differences of all the samples in the other set and do a descending sort on that sum. This however eliminates the ability to specify a number of samples in the set that must differ? Perhaps instead of averaging one set's ratios, I could take N (or M) off the bottom and/or top of the ratio-value-sorted set, average those values, and then sum the differences in ratios to that/those ratios in the other set.
For example, let's take set A:0.9,0.91,0.99 and set B:0.0,0.04,0.5,0.6,0.98. I want at least 2 values of set A to differ with at least 2 values of set B. Whether I take the bottom 2 or top 2 of set A, the average is going to be high and I will find numerous values in set B that differ, but let's walk through the steps.
We should take the max difference as the value for the "row" of sample values to use in a descending sort.
Hold on... When I do the sum, I should only use the opposite (top/bottom) N (or M):
case 1: top 2 of A compared to bottom 2 of B
ave(0.91,0.99) = 0.95
sum(abs(0.95-0.0),abs(0.95-0.04))=1.86
case 2: bottom 2 of A compared to top 2 of B
ave(0.9,0.91) = 0.905
sum(abs(0.905-0.6),abs(0.905-0.98))=0.23
case 3: top 2 of B compared to bottom 2 of A
ave(0.6,0.98) = 0.79
sum(0.9-0.79,0.91-0.79)=0.23
case 4: bottom 2 of B compared to top 2 of A
ave(0.0,0.04) = 0.02
sum(0.91-0.02,0.99-0.02)=1.86
I only need to calculate cases 1 & 2.
Now the question is... which samples are the "different" samples that are added to the list in the output? In other words, the output defined (currently) by:
GROUPRULEPAIR$group_pair_rule\[SET(",join(',',@set1),")>=$set1_min DIFFERS FROM SET(",join(',',@set2),")>=$set2_min]
e.g.
GROUPRULEPAIR1[SET(A1,A2,A3)>=2 DIFFERS FROM SET(B1,B2,B3,B4,B5)>=2]
Intuitively, my feeling is that these are what should be in the output:
GROUPRULEPAIR1[SET(A1,A2,A3)>=2 DIFFERS FROM SET(B1,B2)>=2]
excluding B3, B4, & B5. However, I don't have anything to base that on. Perhaps I should use a sliding window of size N (or M)? I'm basically ordering the output based on how different the samples are.
Also, I haven't taken the minimum read depth requirement yet. A read depth threshold, as with the ratio threshold) is rather sharp/harsh (though it does clean up the amount of output). Perhaps I should weight by number of reads?
Another problem is that I'm comparing the max AOs of each of the samples. The SNP value might be different in those cases. I created an issue for this problem already, so in this revamp, I need to make sure I'm comparing the same respective AO values. I could also employ the RO (reference observations) values in some way. This would allow me to look at more than just one AO, as is the way it's currently written.
If I average instead of sum in the difference values, the values will be between 0 and 1. I could use a difference threshold in that case, e.g. the ratio averages of A.top versus B.bottom or A.bottom versus B.top must differ by at least some value between 0 & 1 (e.g. 0.5) in order for the samples in the respective sets to be considered different? AND, if I do this for all states: RO, AO1, AO2, and AO3, each one is a boolean comparison and I can calculate the difference value as the max of those values.
Still not sure how to create an output pair of sets of samples though. One filter I can use is that the ratios of A.top must be larger than B.bottom or those of B.top must be larger than A.bottom. Each next value would have to differ with the other set's N (or M) average in order to be added. If the sets are below their minimum, they could either be filtered out or output differently (e.g. another file or else after a couple of empty lines).
Weighting... I could take the highest and lowest observation count and weight the AO/DP ratios by (DP - lowest DP) / (highest DP - lowest DP). I don't think that the highest DP is something I can rely on though. The distro typically looks very uneven. I could weight by percentile? Ugh. I guess I'm not going to worry about weight just now. I'm just not sure how to do it reliably. I think I'm sorting things pretty well right now, which takes read depth into account, but that may go away with this new way to compare stuff.
As for creating the set of samples in each group, I could just use the top N / bottom M sample names. Or, I could use the separation threshold. I could either take the average of the top N in A and keep adding from the bottom of B until the difference between the value being added from B and the average of the N values from A dips below the threshold. Similarly, I could add to A from it's top until the separation of the Nth value from A and the average of the bottom M of B dips below the threshold. I could also use that for filtering. As soon as the separation of the top N from A and the Mth value from the bottom of B dips below the threshold, I can filter that result out. Or, instead of filtering, I can create a dividing line.
Finally... can I use featureProximity.pl for the ordering? Feature Proximity computes the closest distance between a multi-coordinate/segmented feature in file 1 and a multi-coordinate/segmented feature in file 2. For each sample, it outputs that closest distance. Then it sums up those distances in another column on which you can sort. By multi-coordinate/segmented features, I mean things like exons. Each has a start and stop. It also means breakpoints in an SV. Each breakpoint's start and stop are the same (unless there's microhomology around the breakpoint - then there's a range in which the break falls). In either case, the feature in the second file gets a closest computed distance on any one of it's parter record's exons/breakpoints. If the feature in the partner file is also segmented into either exons or breakpoints, the closest pair is the one used to compute the distance.
Now, can that translate to this problem? The (1 - AO/DP) ratio can be a start & stop coordinate. I subtract AO/DP from 1 in order to take advantage of the "closest" feature finding functionality. We want to minimize that number. Each sample can have up to 4 of these values (when you include RO/DP). But they are not segments of the same feature...
9/7/2018
I realized this morning that the strategy to use a separation ratio, given that it could apply to which samples to add to the groups AND to whether or not to filter the row (e.g. none of the group rules are met, so filter it out), a user may want to do one, but not the other, i.e. add to the groups using the ratios but don't filter - or filter, but don't add to the base groups. So I could have a --sep-ratio along with a --filter flag and a --greedy-grow flag (negbool, defaulted to 'on'). Sep ratio would be required and default to something modest. It's required so that groups could be created with some sort of separation between them. Without a sep ratio, basically, a sample could be added where the average of 1 group is 0.5 and the value for the added sample could be 0.50000001 or 0.49999999.
I also realized that there's a critical difference in the way featureProximity works and the way I want to use the different AO values. In featureProximity, all of the "segments" (think "exons") are part of the same feature and proximity to ANY of them is proximity to the feature. However, in this case, I don't want the compare the "distances" between AO1 in one sample versus AO2 in another sample. I always want to compare AO1 to AO1 in the samples. That's where the separation needs to be and where I would use 1 - separation-ratio to determine the feature proximity "distance". If there was only 1 ratio to compute between samples, then I could do it. Nor is there a sample column. I would have to create a bunch of files (one for each SNP/value/group combo). E.g. 1 file would have SNP1, group1, AO1's average from all the samples in group1 versus SNP1, group2's members along with a sample column, and AO1's values from each group2 member. Not to mention I would have to inverse the ratios somehow. The AO1's average's proximity to each sample's AO1 value would be computed and a sum of the distances would be output in a new column. Each sample's AO1 would get a distance from the average. This whole strategy would just be way too convoluted and incomprehensible. So I should just stick with a custom script.
I also realized that I should not set a min depth or do any weighting. Users can figure that out for themselves or do filtering another way. Also, when I use AO/DP, a more accurate replacement would be something like AO1/(RO+AO2+AO3) where the AOs are that different snp states. Also, I should add the ability to us GT where the different samples are required to have a different GT. Thus, an option could be --diff-type that's an enum with values genotype-call,allelic-freq.
Let's review the options I want to have...
OLD OPTIONS TO KEEP
-s -- sample group
-d -- group diff min
OLD OPTIONS TO REMOVE
-r -- min depth
-p -- min discordants
-c -- min splits
-v -- min sv support reads
NEW OPTIONS
--diff-type <genotype-call,allelic-freq> -- enum DEFAULT=allelic-freq
--sep-ratio <float> -- float DEFAULT=0.2 -- only used for allelic-freq mode
--filter -- bool DEFAULT=off
--greedy-grow -- negbool ADVANCED DEFAULT=on
Regarding the requirement that there be at least 1 AO call and less than $num_samples AO calls, I should convert that filter into a sort of default-comparison. I could do it by saying if no groups are defined, create a pair of empty groups that imply 'any 1 sample' differs from 'any other sample'.
I should also determine a better format to output in. I should be able to submit multiple comparisons, but also know which (1 or more) group rules were the hit that kept the record, thus, I need the following columns: multiple group rule columns, each with their own pairs of columns listing their members and a pairs of columns indicating their support ratios and a single column indicating the snp value, it's position, and a column for the average separation. So it could look something like this:
grouprule1 A 123456789 0.548 S1,S2 20/22,21/23 S5,S9,S10 2/20,0/33,5/100
Only the numbers for the highest separation would be reported. I.e. if G is also a hit but the average separation was lower, it would not be output. This series of columns would repeat for other hit groups:
grouprule2...
I might consider only outputting the data relevant to this analysis.
Regarding SV rules, I do not need to implement thresholds for PE, SR, or SU, because there exists a tool to filter requirements like those. lumpy outputs AO (which is the combo of split, soft-clipped, and discordant reads). svTyper adds DP which lumpy doesn't output, however I need to check if lumpy outputs RO, in which case, I could compute DP by adding AO + RO. No... I checked and lumpy does not output anything that could be used to get read depth and that's got to be because it required bam files that conytain only split reads or discordant paired-end reads. It might be possible to use the STRANDS number as a substitute, but it would never contain that stuff given the filtering of the bam file. The svTyper output contains both DP and RO numbers.
A couple notes about lumpy/svTyper. The outputs from the pipeline appear to assume ploidy 2 and though I do not see multiple AO values, I suspect there could be... and this may be a case where the AO values could be inter-compared for separation... Also, the pipeline I have runs each sample separately. I will have to look into the lumpy pipeline to see if I can address these 3 issues:
set lumpy ploidy to 1
See if different AO values are possible in lumpy/svtyper are possible and intercomparable in terms of separation thresholds
See if the lumpy samples can be run together so that rankVCF can be used
If samples can't be run together in lumpy analyses, then the filtering should be done using another tool.
Also, I should REQUIRE that RO (or DP?) be present, meaning raw lumpy output is incompatible. Must be svTyper output.
This all simplifies rankVCF to only be used on AO & RO (or DP?) values. Perhaps I should change the rankVCF tool name with this change. Perhaps vcfCompare, vcfSampleCompare, or vcfSeparator, or vcfGroupSort or vcfGroupCompare...???
Alright, here are the action items for what I've concluded:
[x] Increment version to 2.0
[x] Change name to reflect different working concept (i.e. only does sample comparisons)
[x] Remove ratio threshold
[x] Remove min depth threshold
[x] Create a default group of any sample differing from any other sample. Compare all mins to all maxes and output the one with the biggest difference.
[x] Sort based on the difference.
[x] Change output format to a selection of columns from the input along with a set of columns for each comparison with the lists of samples that differ, their ratios, and their SNP value
[x] Change the basis of a difference as the separation being above a sep threshold
[x] Add a --diff-mode for either genotype or allellic freq
[x] Add a filter option to not print anything that doesn't have a difference
[x] Add a flag to indicate whether or not to grow the groups
[x] Remove the SV threshold option
[x] Output header for the non-VCF outfile
[x] Round long decimal values (between 0 and 0) at 4 significant digits
[x] Update existing test
[x] Add tests
[x] Update Makefile.PL for the test module requirements
rankVCF revamp notes
Current Behavior:
There is some basic filtering that is done using an AO/DP ratio and an min depth, but there's also the ability to do comparisons of a group of samples to another group of samples to see if they differ. The comparison works like this:
The problem with this is that thresholding can cause things to be output that are very close. This is dealt with by sorting so that things that are the most different come to the top. However, you still miss stuff.
Proposal:
Instead of filtering via a threshold, just sort based on degree of difference between compared 2 sets of samples. I could average one set and sum the differences of all the samples in the other set and do a descending sort on that sum. This however eliminates the ability to specify a number of samples in the set that must differ? Perhaps instead of averaging one set's ratios, I could take N (or M) off the bottom and/or top of the ratio-value-sorted set, average those values, and then sum the differences in ratios to that/those ratios in the other set.
For example, let's take set A:0.9,0.91,0.99 and set B:0.0,0.04,0.5,0.6,0.98. I want at least 2 values of set A to differ with at least 2 values of set B. Whether I take the bottom 2 or top 2 of set A, the average is going to be high and I will find numerous values in set B that differ, but let's walk through the steps.
We should take the max difference as the value for the "row" of sample values to use in a descending sort.
Hold on... When I do the sum, I should only use the opposite (top/bottom) N (or M):
I only need to calculate cases 1 & 2.
Now the question is... which samples are the "different" samples that are added to the list in the output? In other words, the output defined (currently) by:
GROUPRULEPAIR$group_pair_rule\[SET(",join(',',@set1),")>=$set1_min DIFFERS FROM SET(",join(',',@set2),")>=$set2_min]
e.g.
GROUPRULEPAIR1[SET(A1,A2,A3)>=2 DIFFERS FROM SET(B1,B2,B3,B4,B5)>=2]
Intuitively, my feeling is that these are what should be in the output:
GROUPRULEPAIR1[SET(A1,A2,A3)>=2 DIFFERS FROM SET(B1,B2)>=2]
excluding B3, B4, & B5. However, I don't have anything to base that on. Perhaps I should use a sliding window of size N (or M)? I'm basically ordering the output based on how different the samples are.
Also, I haven't taken the minimum read depth requirement yet. A read depth threshold, as with the ratio threshold) is rather sharp/harsh (though it does clean up the amount of output). Perhaps I should weight by number of reads?
Another problem is that I'm comparing the max AOs of each of the samples. The SNP value might be different in those cases. I created an issue for this problem already, so in this revamp, I need to make sure I'm comparing the same respective AO values. I could also employ the RO (reference observations) values in some way. This would allow me to look at more than just one AO, as is the way it's currently written.
If I average instead of sum in the difference values, the values will be between 0 and 1. I could use a difference threshold in that case, e.g. the ratio averages of A.top versus B.bottom or A.bottom versus B.top must differ by at least some value between 0 & 1 (e.g. 0.5) in order for the samples in the respective sets to be considered different? AND, if I do this for all states: RO, AO1, AO2, and AO3, each one is a boolean comparison and I can calculate the difference value as the max of those values.
Still not sure how to create an output pair of sets of samples though. One filter I can use is that the ratios of A.top must be larger than B.bottom or those of B.top must be larger than A.bottom. Each next value would have to differ with the other set's N (or M) average in order to be added. If the sets are below their minimum, they could either be filtered out or output differently (e.g. another file or else after a couple of empty lines).
Weighting... I could take the highest and lowest observation count and weight the AO/DP ratios by (DP - lowest DP) / (highest DP - lowest DP). I don't think that the highest DP is something I can rely on though. The distro typically looks very uneven. I could weight by percentile? Ugh. I guess I'm not going to worry about weight just now. I'm just not sure how to do it reliably. I think I'm sorting things pretty well right now, which takes read depth into account, but that may go away with this new way to compare stuff.
As for creating the set of samples in each group, I could just use the top N / bottom M sample names. Or, I could use the separation threshold. I could either take the average of the top N in A and keep adding from the bottom of B until the difference between the value being added from B and the average of the N values from A dips below the threshold. Similarly, I could add to A from it's top until the separation of the Nth value from A and the average of the bottom M of B dips below the threshold. I could also use that for filtering. As soon as the separation of the top N from A and the Mth value from the bottom of B dips below the threshold, I can filter that result out. Or, instead of filtering, I can create a dividing line.
Finally... can I use featureProximity.pl for the ordering? Feature Proximity computes the closest distance between a multi-coordinate/segmented feature in file 1 and a multi-coordinate/segmented feature in file 2. For each sample, it outputs that closest distance. Then it sums up those distances in another column on which you can sort. By multi-coordinate/segmented features, I mean things like exons. Each has a start and stop. It also means breakpoints in an SV. Each breakpoint's start and stop are the same (unless there's microhomology around the breakpoint - then there's a range in which the break falls). In either case, the feature in the second file gets a closest computed distance on any one of it's parter record's exons/breakpoints. If the feature in the partner file is also segmented into either exons or breakpoints, the closest pair is the one used to compute the distance.
Now, can that translate to this problem? The (1 - AO/DP) ratio can be a start & stop coordinate. I subtract AO/DP from 1 in order to take advantage of the "closest" feature finding functionality. We want to minimize that number. Each sample can have up to 4 of these values (when you include RO/DP). But they are not segments of the same feature...
9/7/2018
I realized this morning that the strategy to use a separation ratio, given that it could apply to which samples to add to the groups AND to whether or not to filter the row (e.g. none of the group rules are met, so filter it out), a user may want to do one, but not the other, i.e. add to the groups using the ratios but don't filter - or filter, but don't add to the base groups. So I could have a --sep-ratio along with a --filter flag and a --greedy-grow flag (negbool, defaulted to 'on'). Sep ratio would be required and default to something modest. It's required so that groups could be created with some sort of separation between them. Without a sep ratio, basically, a sample could be added where the average of 1 group is 0.5 and the value for the added sample could be 0.50000001 or 0.49999999.
I also realized that there's a critical difference in the way featureProximity works and the way I want to use the different AO values. In featureProximity, all of the "segments" (think "exons") are part of the same feature and proximity to ANY of them is proximity to the feature. However, in this case, I don't want the compare the "distances" between AO1 in one sample versus AO2 in another sample. I always want to compare AO1 to AO1 in the samples. That's where the separation needs to be and where I would use
1 - separation-ratio
to determine the feature proximity "distance". If there was only 1 ratio to compute between samples, then I could do it. Nor is there a sample column. I would have to create a bunch of files (one for each SNP/value/group combo). E.g. 1 file would have SNP1, group1, AO1's average from all the samples in group1 versus SNP1, group2's members along with a sample column, and AO1's values from each group2 member. Not to mention I would have to inverse the ratios somehow. The AO1's average's proximity to each sample's AO1 value would be computed and a sum of the distances would be output in a new column. Each sample's AO1 would get a distance from the average. This whole strategy would just be way too convoluted and incomprehensible. So I should just stick with a custom script.I also realized that I should not set a min depth or do any weighting. Users can figure that out for themselves or do filtering another way. Also, when I use AO/DP, a more accurate replacement would be something like AO1/(RO+AO2+AO3) where the AOs are that different snp states. Also, I should add the ability to us GT where the different samples are required to have a different GT. Thus, an option could be --diff-type that's an enum with values genotype-call,allelic-freq.
Let's review the options I want to have...
OLD OPTIONS TO KEEP
OLD OPTIONS TO REMOVE
NEW OPTIONS
Regarding the requirement that there be at least 1 AO call and less than $num_samples AO calls, I should convert that filter into a sort of default-comparison. I could do it by saying if no groups are defined, create a pair of empty groups that imply 'any 1 sample' differs from 'any other sample'.
I should also determine a better format to output in. I should be able to submit multiple comparisons, but also know which (1 or more) group rules were the hit that kept the record, thus, I need the following columns: multiple group rule columns, each with their own pairs of columns listing their members and a pairs of columns indicating their support ratios and a single column indicating the snp value, it's position, and a column for the average separation. So it could look something like this:
grouprule1 A 123456789 0.548 S1,S2 20/22,21/23 S5,S9,S10 2/20,0/33,5/100
Only the numbers for the highest separation would be reported. I.e. if G is also a hit but the average separation was lower, it would not be output. This series of columns would repeat for other hit groups:
grouprule2...
I might consider only outputting the data relevant to this analysis.
Regarding SV rules, I do not need to implement thresholds for PE, SR, or SU, because there exists a tool to filter requirements like those. lumpy outputs AO (which is the combo of split, soft-clipped, and discordant reads). svTyper adds DP which lumpy doesn't output, however I need to check if lumpy outputs RO, in which case, I could compute DP by adding AO + RO. No... I checked and lumpy does not output anything that could be used to get read depth and that's got to be because it required bam files that conytain only split reads or discordant paired-end reads. It might be possible to use the STRANDS number as a substitute, but it would never contain that stuff given the filtering of the bam file. The svTyper output contains both DP and RO numbers.
A couple notes about lumpy/svTyper. The outputs from the pipeline appear to assume ploidy 2 and though I do not see multiple AO values, I suspect there could be... and this may be a case where the AO values could be inter-compared for separation... Also, the pipeline I have runs each sample separately. I will have to look into the lumpy pipeline to see if I can address these 3 issues:
If samples can't be run together in lumpy analyses, then the filtering should be done using another tool.
Also, I should REQUIRE that RO (or DP?) be present, meaning raw lumpy output is incompatible. Must be svTyper output.
This all simplifies rankVCF to only be used on AO & RO (or DP?) values. Perhaps I should change the rankVCF tool name with this change. Perhaps vcfCompare, vcfSampleCompare, or vcfSeparator, or vcfGroupSort or vcfGroupCompare...???
Alright, here are the action items for what I've concluded: