Remaining questions for manuscript

[x] Regarding length standardization of the data

Ultimately, the only change made to this metric has been to make it in terms of kilobases, which appears to be more standard, and makes human comparisons of the values easier.

Yes, standardizing the WRSD counts by kilobase is a sensible approach. But is the standardization by partition length or the length of coding/non-coding regions still included as well? In our manuscript, it currently says:

Moreover, two types of data standardization were conducted per genome to account for differences 
in sequence length: we normalized the number of WRSD by partition length to account for the length 
differences between the four structural genome partitions, and we normalized the number of WRSD by 
the length ratio of coding versus non-coding sections to account for the length differences between 
the sum of all coding and all non-coding genome sections.

Is the above sentences from the manuscript even correct?

Or said differently: Can you please briefly explain how the values of the columns 'lowCovWin_abs' and 'lowCovWin_perKilobase' differ so that I can check if the manuscript text is correct.

[x] Regarding the removal of outliers

Similar to data transformation, I am cautious of outlier removal unless there is a justification when taking into account the nature of the data or the intended use of the data. As we are performing a study of samples that are intended to be representative of samples in a given taxon, it would make sense to remove observations that are incredibly dissimilar from the others. There are multiple standard approaches to outlier detection, and I elected to use Tukey's fences. This is mostly due to its common use for boxplots, which was one of the intended purposes of removing the outliers. Since I like to be conservative with outlier removal, the current k value for determining the range is 3, which Tukey considered to be "far out". To reflect the nature of the statistical tests, the outliers are determined on a class level, within the creation of the figure_data. In addition to this outlier filtering being done on the WRSD metric, it is also done for the E-score data to be consistent with how we process the data. The results for this appear to be in line with our intention to improve the quality of the figures and statistical tests without overly curating the data.

Yes, that sounds great and is exactly what we need for the paper: a conservative detection and removal of outliers for Figures 2 and 3 as well as the statistical tests (i.e., Tables 3, 4 and S3). Is the below text correct so that it can be added to our manuscript?

WRSD values that were identified as outliers based on Tukey's ``far out`` fences of  ±3 x interquartile 
range (IQR) were removed from the data prior to the analyses.
...
Hence, we marked all E-score values that were more than 3 x IQR below the first quartile value of 
the data set-wide E-score distribution (i.e., Tukey's ``far out`` lower fence) as outliers and removed them from 
the data prior to the analyses.

[x] Regarding the calculation of coding/noncoding ratios

The output of the tabular stats from PACVr has been updated to allow easier compilation of the data (and remove unhandled errors), Unpartitioned statistics have been added to the coding and noncoding summary files to expedite coding-noncoding ratios, and WRSD in the paper's analysis. As noted above, the WRSD metric has been updated to be in terms of kilobases, which ultimately results in the metric being 1000 times greater than before.

Very good, but the new lines Unpartitioned do not seem to add up. For example, in genome NC_000932 we have the following situation:

NC_000932.1_coverage.summary.noncoding.tsv: Unpartitioned 10 29133 ...
NC_000932.1_coverage.summary.genes.tsv: Unpartitioned 49 95118 ...
NC_000932.1_summary.regions.tsv: Complete_genome 80 153633 ...

Regarding length standardization of the data

The normalization of WRSD is performed for the total kilobase count of the specified grouping. To be specific, for each row in each summary file, lowCovWin_perKilobase = lowCovWin_abs / regionLen * 1000. This does appear to correspond with the current wording of the manuscript.

Regarding the removal of outliers

This current wording of the outlier removal process appears accurate. If desired, we could also mention that this is being performed according to each class/state/grouping of the data under analysis. Intuitively this seemed like the right approach and appears to be the recommendation for ANOVA and nonparametric group tests. That means that maybe mentioning this would be redundant, as it is best practice for these sorts of tests.

Regarding the calculation of coding/noncoding ratios

The cause of this mismatch between the sum of Complete_genome length and the sum of Unpartitioned lengths of coding/noncoding comes down to filtering out sliding windows that are below the windowSize. For that specific example, NC_000932, this filtering leads to a reduction in the length of coding and noncoding groupings by 13% and 35%, respectively. In general, this filtering process results in a non-zero portion of each gene and noncoding gap from being discarded from consideration. Additionally, this cumulative amount removed will proportionally increase as the number of sequences increases and the typical length of these sequences decreases. For Complete_genome and the quadripartite groupings, the amount of base pairs lost is incredibly small, as these sequences are both long and few in number. The opposite is true for genes, and especially so for noncoding sequences. In summary, when creating the sliding windows, we are taking each defined sequence, dividing it into sliding windows of windowSize, and whatever remainder at the end (<windowSize) will be filtered out.

If we intend to use these Unpartitioned lengths to compare the corresponding groups' evenness of WRSD metric, changing these lengths would involve changing how these sliding windows are filtered. The main idea I considered related to this would be to normalize the WRSD counts according to the sliding window length instead of filtering. However, I do not know if this would be an appropriate approach.

Alternatively, if instead of using these lengths for these metric comparisons (E-score and WRSD), we just want to make a direct comparison of these lengths in isolation, this could be an additional output related to the tabular statistics. I have recently refactored processes related to this portion of PACVr if that is the case, to improve the ease of this.

michaelgruenstaeudl / PACVr