michaelgruenstaeudl / PACVr

Plastome Assembly Coverage Visualization in R
Other
3 stars 4 forks source link

Updates to tabular stats and paper figures #44

Closed alephnull7 closed 1 month ago

alephnull7 commented 1 month ago

These changes attempt to address the requested changes necessary for the manuscript.

Regarding updating the WRSD metric, I investigated other normalization approaches and possible transformations to the existing normalized metric. For sample length normalization, I could not find other techniques that would produce a meaningful measure or be appropriate for making proper class/group comparisons. Other common bioinformatics normalization calculations involve adjusting for sequencing depth and considering other qualities intended to allow more comparable comparisons between samples of the same genome. The WRSD count is a measure of depth, so the former is not appropriate, and the comparisons are between different genomes, so the latter adjustments don't apply. The only unused feature of the samples we appear to have access to is the taxonomic association, but that would probably be more appropriate to be used as another class in the statistical tests (or as a categorical variable in a regression or classification model) than directly changing the WRSD metric. Ultimately, the only change made to this metric has been to make it in terms of kilobases, which appears to be more standard, and makes human comparisons of the values easier.

Just as the original WRSD counts were heteroskedastic between classes, the same is true for the length normalized metrics, so since we were investigating transformations, my main focus was attempting to produce homoskedasticity, so that our options for statistical tests would be expanded. There was no success on this front, even when performing Box-Cox transformations and other log-likelihood optimization techniques, so this endeavor was abandoned. In many cases, when performing Levene's test on the transformed data, the p-value was improved compared to the nontransformed data, but similarly close to 0. If homoskedasticity could be achieved, the tradeoff of reduced interpretability might make sense, but that is not the case.

Similar to data transformation, I am cautious of outlier removal unless there is a justification when taking into account the nature of the data or the intended use of the data. As we are performing a study of samples that are intended to be representative of samples in a given taxon, it would make sense to remove observations that are incredibly dissimilar from the others. There are multiple standard approaches to outlier detection, and I elected to use Tukey's fences. This is mostly due to its common use for boxplots, which was one of the intended purposes of removing the outliers. Since I like to be conservative with outlier removal, the current k value for determining the range is 3, which Tukey considered to be "far out". To reflect the nature of the statistical tests, the outliers are determined on a class level, within the creation of the figure_data. In addition to this outlier filtering being done on the WRSD metric, it is also done for the E-score data to be consistent with how we process the data. The results for this appear to be in line with our intention to improve the quality of the figures and statistical tests without overly curating the data.

The output of the tabular stats from PACVr has been updated to allow easier compilation of the data (and remove unhandled errors), Unpartitioned statistics have been added to the coding and noncoding summary files to expedite coding-noncoding ratios, and WRSD in the paper's analysis. As noted above, the WRSD metric has been updated to be in terms of kilobases, which ultimately results in the metric being 1000 times greater than before.

The creation of Figure 1 is now done entirely within the appropriate script, instead of being assembled in the manuscript as subfigures. Additionally, the creation of Figure 1A has been updated to be a more manual process, resulting in correctly positioned outlier labels and easier-to-parse labels.

michaelgruenstaeudl commented 1 month ago

My answers are intercalated below.

Part 1

Regarding length standardization of the data

Regarding updating the WRSD metric, I investigated other normalization approaches and possible transformations to the existing normalized metric. For sample length normalization, I could not find other techniques that would produce a meaningful measure or be appropriate for making proper class/group comparisons. Other common bioinformatics normalization calculations involve adjusting for sequencing depth and considering other qualities intended to allow more comparable comparisons between samples of the same genome. The WRSD count is a measure of depth, so the former is not appropriate, and the comparisons are between different genomes, so the latter adjustments don't apply. The only unused feature of the samples we appear to have access to is the taxonomic association, but that would probably be more appropriate to be used as another class in the statistical tests (or as a categorical variable in a regression or classification model) than directly changing the WRSD metric. Ultimately, the only change made to this metric has been to make it in terms of kilobases, which appears to be more standard, and makes human comparisons of the values easier.

Yes, standardizing the WRSD counts by kilobase is a sensible approach. But is the standardization by partition length or the length of coding/non-coding regions still included as well? In our manuscript, it currently says:

Moreover, two types of data standardization were conducted per genome to account for differences 
in sequence length: we normalized the number of WRSD by partition length to account for the length 
differences between the four structural genome partitions, and we normalized the number of WRSD by 
the length ratio of coding versus non-coding sections to account for the length differences between 
the sum of all coding and all non-coding genome sections.

Is the above sentences from the manuscript even correct?

Or said differently: Can you please briefly explain how the values of the columns 'lowCovWin_abs' and 'lowCovWin_perKilobase' differ so that I can check if the manuscript text is correct.

Regarding homoscedasticity of the data

Just as the original WRSD counts were heteroskedastic between classes, the same is true for the length normalized metrics, so since we were investigating transformations, my main focus was attempting to produce homoskedasticity, so that our options for statistical tests would be expanded. There was no success on this front, even when performing Box-Cox transformations and other log-likelihood optimization techniques, so this endeavor was abandoned. In many cases, when performing Levene's test on the transformed data, the p-value was improved compared to the nontransformed data, but similarly close to 0. If homoskedasticity could be achieved, the tradeoff of reduced interpretability might make sense, but that is not the case.

Not a problem. It was worth a try but no need to spend any time on this anymore.

Regarding the removal of outliers

Similar to data transformation, I am cautious of outlier removal unless there is a justification when taking into account the nature of the data or the intended use of the data. As we are performing a study of samples that are intended to be representative of samples in a given taxon, it would make sense to remove observations that are incredibly dissimilar from the others. There are multiple standard approaches to outlier detection, and I elected to use Tukey's fences. This is mostly due to its common use for boxplots, which was one of the intended purposes of removing the outliers. Since I like to be conservative with outlier removal, the current k value for determining the range is 3, which Tukey considered to be "far out". To reflect the nature of the statistical tests, the outliers are determined on a class level, within the creation of the figure_data. In addition to this outlier filtering being done on the WRSD metric, it is also done for the E-score data to be consistent with how we process the data. The results for this appear to be in line with our intention to improve the quality of the figures and statistical tests without overly curating the data.

Yes, that sounds great and is exactly what I was thinking of: a conservative detection and removal of outliers for Figures 2 and 3 as well as the statistical tests (i.e., Tables 3, 4 and S3). Is the below text correct so that it can be added to our manuscript?

WRSD values that were identified as outliers based on Tukey's ``far out`` fences of  ±3 x interquartile 
range (IQR) were removed from the data prior to the analyses.
...
Hence, we marked all E-score values that were more than 3 x IQR below the first quartile value of 
the data set-wide E-score distribution (i.e., Tukey's ``far out`` lower fence) as outliers and removed them from 
the data prior to the analyses. 
michaelgruenstaeudl commented 1 month ago

My answers are intercalated below.

Part 2

Regarding the calculation of coding/noncoding ratios

The output of the tabular stats from PACVr has been updated to allow easier compilation of the data (and remove unhandled errors), Unpartitioned statistics have been added to the coding and noncoding summary files to expedite coding-noncoding ratios, and WRSD in the paper's analysis. As noted above, the WRSD metric has been updated to be in terms of kilobases, which ultimately results in the metric being 1000 times greater than before.

Very good, but the new lines Unpartitioned do not seem to add up. For example, in genome NC_000932 we have the following situation:

Regarding Figure 1

The creation of Figure 1 is now done entirely within the appropriate script, instead of being assembled in the manuscript as subfigures. Additionally, the creation of Figure 1A has been updated to be a more manual process, resulting in correctly positioned outlier labels and easier-to-parse labels.

Excellent. I have made minor changes to it still, and it looks great now.