Closed michaelgruenstaeudl closed 1 month ago
The normalization of WRSD is performed for the total kilobase count of the specified grouping. To be specific, for each row in each summary file, lowCovWin_perKilobase = lowCovWin_abs / regionLen * 1000
. This does appear to correspond with the current wording of the manuscript.
This current wording of the outlier removal process appears accurate. If desired, we could also mention that this is being performed according to each class/state/grouping of the data under analysis. Intuitively this seemed like the right approach and appears to be the recommendation for ANOVA and nonparametric group tests. That means that maybe mentioning this would be redundant, as it is best practice for these sorts of tests.
The cause of this mismatch between the sum of Complete_genome
length and the sum of Unpartitioned
lengths of coding/noncoding comes down to filtering out sliding windows that are below the windowSize
. For that specific example, NC_000932
, this filtering leads to a reduction in the length of coding and noncoding groupings by 13% and 35%, respectively. In general, this filtering process results in a non-zero portion of each gene and noncoding gap from being discarded from consideration. Additionally, this cumulative amount removed will proportionally increase as the number of sequences increases and the typical length of these sequences decreases. For Complete_genome
and the quadripartite groupings, the amount of base pairs lost is incredibly small, as these sequences are both long and few in number. The opposite is true for genes, and especially so for noncoding sequences. In summary, when creating the sliding windows, we are taking each defined sequence, dividing it into sliding windows of windowSize
, and whatever remainder at the end (<windowSize
) will be filtered out.
If we intend to use these Unpartitioned
lengths to compare the corresponding groups' evenness of WRSD metric, changing these lengths would involve changing how these sliding windows are filtered. The main idea I considered related to this would be to normalize the WRSD counts according to the sliding window length instead of filtering. However, I do not know if this would be an appropriate approach.
Alternatively, if instead of using these lengths for these metric comparisons (E-score and WRSD), we just want to make a direct comparison of these lengths in isolation, this could be an additional output related to the tabular statistics. I have recently refactored processes related to this portion of PACVr if that is the case, to improve the ease of this.
Thank you for the clarifications. I believe that our code is then fine as it is!
Closing this issue.
Yes, standardizing the WRSD counts by kilobase is a sensible approach. But is the standardization by partition length or the length of coding/non-coding regions still included as well? In our manuscript, it currently says:
Is the above sentences from the manuscript even correct?
Or said differently: Can you please briefly explain how the values of the columns 'lowCovWin_abs' and 'lowCovWin_perKilobase' differ so that I can check if the manuscript text is correct.
Yes, that sounds great and is exactly what we need for the paper: a conservative detection and removal of outliers for Figures 2 and 3 as well as the statistical tests (i.e., Tables 3, 4 and S3). Is the below text correct so that it can be added to our manuscript?
Very good, but the new lines
Unpartitioned
do not seem to add up. For example, in genomeNC_000932
we have the following situation:Unpartitioned 10 29133 ...
Unpartitioned 49 95118 ...
Complete_genome 80 153633 ...