michaelgruenstaeudl / PACVr

Plastome Assembly Coverage Visualization in R
Other
3 stars 4 forks source link

Remaining questions for manuscript #47

Closed michaelgruenstaeudl closed 1 month ago

michaelgruenstaeudl commented 1 month ago

Ultimately, the only change made to this metric has been to make it in terms of kilobases, which appears to be more standard, and makes human comparisons of the values easier.

Yes, standardizing the WRSD counts by kilobase is a sensible approach. But is the standardization by partition length or the length of coding/non-coding regions still included as well? In our manuscript, it currently says:

Moreover, two types of data standardization were conducted per genome to account for differences 
in sequence length: we normalized the number of WRSD by partition length to account for the length 
differences between the four structural genome partitions, and we normalized the number of WRSD by 
the length ratio of coding versus non-coding sections to account for the length differences between 
the sum of all coding and all non-coding genome sections.

Is the above sentences from the manuscript even correct?

Or said differently: Can you please briefly explain how the values of the columns 'lowCovWin_abs' and 'lowCovWin_perKilobase' differ so that I can check if the manuscript text is correct.

Yes, that sounds great and is exactly what we need for the paper: a conservative detection and removal of outliers for Figures 2 and 3 as well as the statistical tests (i.e., Tables 3, 4 and S3). Is the below text correct so that it can be added to our manuscript?

WRSD values that were identified as outliers based on Tukey's ``far out`` fences of  ±3 x interquartile 
range (IQR) were removed from the data prior to the analyses.
...
Hence, we marked all E-score values that were more than 3 x IQR below the first quartile value of 
the data set-wide E-score distribution (i.e., Tukey's ``far out`` lower fence) as outliers and removed them from 
the data prior to the analyses. 

Very good, but the new lines Unpartitioned do not seem to add up. For example, in genome NC_000932 we have the following situation:

alephnull7 commented 1 month ago

Regarding length standardization of the data

The normalization of WRSD is performed for the total kilobase count of the specified grouping. To be specific, for each row in each summary file, lowCovWin_perKilobase = lowCovWin_abs / regionLen * 1000. This does appear to correspond with the current wording of the manuscript.

Regarding the removal of outliers

This current wording of the outlier removal process appears accurate. If desired, we could also mention that this is being performed according to each class/state/grouping of the data under analysis. Intuitively this seemed like the right approach and appears to be the recommendation for ANOVA and nonparametric group tests. That means that maybe mentioning this would be redundant, as it is best practice for these sorts of tests.

Regarding the calculation of coding/noncoding ratios

The cause of this mismatch between the sum of Complete_genome length and the sum of Unpartitioned lengths of coding/noncoding comes down to filtering out sliding windows that are below the windowSize. For that specific example, NC_000932, this filtering leads to a reduction in the length of coding and noncoding groupings by 13% and 35%, respectively. In general, this filtering process results in a non-zero portion of each gene and noncoding gap from being discarded from consideration. Additionally, this cumulative amount removed will proportionally increase as the number of sequences increases and the typical length of these sequences decreases. For Complete_genome and the quadripartite groupings, the amount of base pairs lost is incredibly small, as these sequences are both long and few in number. The opposite is true for genes, and especially so for noncoding sequences. In summary, when creating the sliding windows, we are taking each defined sequence, dividing it into sliding windows of windowSize, and whatever remainder at the end (<windowSize) will be filtered out.

If we intend to use these Unpartitioned lengths to compare the corresponding groups' evenness of WRSD metric, changing these lengths would involve changing how these sliding windows are filtered. The main idea I considered related to this would be to normalize the WRSD counts according to the sliding window length instead of filtering. However, I do not know if this would be an appropriate approach.

Alternatively, if instead of using these lengths for these metric comparisons (E-score and WRSD), we just want to make a direct comparison of these lengths in isolation, this could be an additional output related to the tabular statistics. I have recently refactored processes related to this portion of PACVr if that is the case, to improve the ease of this.

michaelgruenstaeudl commented 1 month ago

Thank you for the clarifications. I believe that our code is then fine as it is!

Closing this issue.