Summarizing and visualizing the regression-based statistical tests

Overview

The purpose of this issue/document is two-fold:

to summarize the regressions analyses of the project in a textual form
to summarize the results of these regression analyses in both a textual form and a visual form.

Regression analyses

The regression analyses are conducted through this script: https://github.com/michaelgruenstaeudl/PACVr/blob/master/inst/extdata/scripts_for_figures_tables/PREP_RegressionAnalysis_RunAnalyses.R

The input data for this script is found here: https://github.com/michaelgruenstaeudl/PACVr/tree/master/inst/extdata/scripts_for_figures_tables/input/data_for_regression_analyses_2024_10_16.rds

Textual summary of analyses and results

METHODS Regression-based statistical tests Regression models that implement either linear regressions or decision trees were used to test the hypotheses of this investigation. These models employed predictor variables that are also evaluated in our group-based statistical tests as well as control variables (hereafter ‘covariates’) that may have an additional, confounding impact on sequencing depth and sequencing evenness. The regression models on sequencing depth used the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome as predictor variables, with each category within these predictor variables defined as independent. The effects of predictor variables on sequencing evenness were modeled as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were specified as covariates. The regression models on sequencing evenness, by comparison, used the assembly quality indicators (i.e., the number of ambiguous nucleotides and the number of IR mismatches) and the identity of the sequencing platform employed as predictor variables, with each sequencing platform category defined as independent. The effects of each assembly quality indicator on sequencing evenness were modeled as a linear regression, the effect of the identity of the sequencing platform employed on sequencing evenness as a decision tree. Total genome size and the exact ratio of coding versus non-coding plastome sections were used as covariates. The identity of the assembly software tool employed, by contrast, was not used as a predictor variable due to its high proportion of missing data. Outliers within any predictor or control variable category were removed from the dataset prior to model scoring. All regression-based models were implemented in R via the R package tidymodels (Kuhn and Wickham 2020).

RESULTS Regression-based statistical tests The results of our regression-based statistical tests on sequencing depth indicated that both the separation into the four structural partitions of a plastid genome and the separation between the coding versus the non-coding sections of each plastome had a significant effect on, and a moderate explanatory ability for, the observed variability in sequencing depth. The results of our regression-based statistical tests on sequencing evenness indicated that the identity of the sequencing platform employed had a significant effect on the observed variability in sequencing evenness. The linear regression regarding the effect of the assembly quality indicators on sequencing evenness indicated that only the number of ambiguous nucleotides had a significant impact on sequencing evenness, corroborating the results of our correlation coefficient analysis among the group-based statistical tests.

Is the above summary complete?

[ ] Yes
[ ] No

BIBTEX

@Manual{KuhnAndWickham2020,
title = {Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles},
author = {Max Kuhn and Hadley Wickham},
url = {https://www.tidymodels.org},
year = {2020},
}

Visual summary of results

Premise

The results of the above regression analyses are difficult to interpret for a non-statistician, because they are primarily textual. For our planned manuscript, it would be great if we could generate graphical visualizations of these results, especially such that are simple, colorful, and intuitive (yes, most manuscript readers need to be treated like children!).

$${\Large \color{red}Start~here!}$$

Improving existing figures and combining them into a collage

Two types of plots are already produced through our code:

variable importance plots
decision tree plots

Both of these plots are a good start, but they are somewhat difficult to interpret for a non-statistician. One the one hand, these existing plots could be further simplified to make them more intuitive. Decision tree plots, for example, can be improved compared to the default settings via the options explained in http://www.milbo.org/rpart-plot/prp.pdf. On the other hand, alternative ways to visualize regression analysis results exist. Below are examples of such visualizations. Which exact visualizations would make sense on our data is a matter we still have to discuss as a group. If we can adapt and implement some of these code examples (especially example 1 !) for our data, we should be able to create similar visualizations.

Example 1 This code seems visualize the results of the regression decision trees via the package parttree. https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/

Example 2 This code seems visualize the results of the regression decision trees via the package parsnip. https://www.tmwr.org/explain

Goal

The production of publication-quality figures that would help a manuscript reader in interpreting the regression analysis results.

Comment Regarding Issue #51

Background:

Though I do not have an extensive background in writing scripts in R, I have been actively building my skills this past week and a half to better contribute to this specific data visualization issue case, and I am starting to get a better grip on it for this project!

The current primary foundation of visualization in this script is resting on vip, ggplot2, and other minor features of svglite and rplot. However, to obtain a more refined visualization structure, we can take the current visualization functions and/or manipulated data to be expanding on via various potential routes that I have been looking into.

Potential Routes:

Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip
- As mentioned in the original examples presented from the issue case above, the code for modelling decision trees can simply be rewritten/extended to utilize various capabilities that are found natively in the parttree package. However, I have found that parttree does not seem to provide a manageable option for plotting variable importance, so I would recommend extending further capabilities natively found in vip since the package is already utilized, but with including the parsnip ecosystem in order to create a better, more modern-looking data visualization.
Exporting to RDS and using a Python script for data visualization
- Another option that seems easily doable would be to export the manipulated data to an RDS file, then have a separate Python script that reads the RDS file using pyreadr and creates modern-looking, accurate data visualizations of decision trees and variable importance plots using the popular libraries seaborn and matplotlib.
Maximizing the existing capabilities of ggplot2 (Not Recommended)
- Another route would be to simply not include any additional packages, and just maximize the capabilities that are presently found in ggplot2 and vip to manually change things such as colors, fonts, etc. to create a custom theme from scratch, since they are already used in the script. (This route is more labor intensive, thus less preferred)

My Next Steps:

Personally, even though I am rather inclined to attempt using Python (and I may still do so in my free time) given my background in the language, I believe that the most straightforward approach ultimately is the first route I mentioned above, Modifying the current implementation to incorporate aspects from parttree alongside parsnip with vip. Thus, I will continue my learning path in getting up-to-speed with R and gaining a better understanding of this current Regression Analysis script that's at-hand!

I should have my finalized draft my this Friday, November 1st, and I look forward to my next comment then!

The visualizations that we already have (i.e., variable importance plots, decision tree plots) are not incorrect or otherwise "bad" in any way. Quite on the contrary: they are a great first step. They also do not need to be modern-looking in any particular way. In their current form, they are just a bit difficult to interpret for a layman (as illustrated by the following example), and I believe that we can generate some valid figures that are easier to grasp by a non-expert reader.

Example: The following collage of variable importance plots correctly displays the variable importance for each analysis performed, but does it convey a clear message?

Variable Importance Collage

By comparison, I consider the visualization of decision boundaries more intuitive. The parttree output as presented on https://paulvanderlaken.com/2020/03/31/visualizing-decision-tree-partition-and-decision-boundaries/ or the boundary decision plot made via ggplot at https://koalaverse.github.io/homlr/notebooks/09-decision-trees.nb.html is something that a non-expert reader would grasp quickly.

Conceptually, something like the motivating example at https://cran.r-project.org/web/packages/ggparty/vignettes/ggparty-graphic-partying.html would be amazing! After all, I see a decision tree and bar charts for each tree node in that figure. Conceptually, we have the same data from our regression analysis results.

When analyzing the whole dataset we have the following findings:

Records are distributed evenly based on Chromosome types and RegionSubset. While looking at the frequency of Coding and non-coding regions for each chromosome type, there is even data distribution in records.

barplot

When we look into avgLength and lowCovWin, Chromosome types are distinguished by shape and regionSubset by color, We found that there is evenly distribution of records across avgLength which does not segregate the records clearly and makes it more difficult to predict lowCovWin using avgLength.

scatterplot1

When segregating records on the basis of regionLen and lowCovWin, We found that larger regionlen has LSC chromosomes and most of the records have a “coding” regionsubset. This plot clearly separates sequences with LSC chromosome and coding regionsubset. This signifies the importance of regionlen on predicting “lowCovWin.

scatterplot2

I tried plotting existing decision tree models (both Coverage Analysis and Evenness Analysis) using the geom_parttree method from parttree package, I was not able to plot because of the following concerns:

QUADRIPARTITE REGION and CODING-NONCODING DIVISION on SEQ COVERAGE decision tree model uses 4 predictor variables (Chromosome, RegionSubset, regionLen, avgLength) but it seems that geom_parttree only accepts one or two predictors.
Effects of only the QUADRIPARTITE REGION on SEQ COVERAGE uses 2 predictor variables (Chromosome and RegionSubset) but both of the predictors are categorical datasets. While plotting parttree and scatterplot for two predictors, it produces the following output. It doesn't provide any significant information about the decision tree.
Apart from that for all the Evenness analysis decision tree, I could not find any significant plot generated by parttree.
When considering the effect of sequencingMethod only on Sequence Evenness, Here the first figure is generated by parttree and geom_point and 2nd figure is generated by geom_point only. Both are similar. This figure shows the distribution of the records for each SequencingMethod.

Please help me on analyzing the existing decision tree models using parttree package.

@buddhathapa12 Good work! Your attempts are quite valuable, as they clarify that plotting decision tree models may not be easy due to the incompatibility of the number and type of predictor variables in our data. I will look into your findings more in a few days and then respond in more detail.

In addition to the model variable importance plots, I have added permutation variable importance plots, which give insight into whether variables determined by the model are important in practice. Essentially this is a test for overfitting, which, due to the hyperparameter tuning done to create the model, aligns with the model variable importance. The one thing to mention is that since the permutation importance is calculated by repeated permutations of the data (in this case the training data used to determine the hyperparameters) there is an associated standard deviation to the metric. For example, the assembly method's high standard deviation relative to its importance does give us additional evidence that its predictive ability is likely limited. cov_import seq_import asm_import

By making some small tweaks to the existing regression models and utilizing ggparty I have updated the styling of the decision trees while keeping them relatively simple. If these trees were being used as sole visual depictions of the models I would likely add some of the more exotic plotting features of the package. Instead, I delegated the plotting of the trees against training and testing data onto their own plots. However, you will see that the color palettes used for low-coverage windows and evenness scores are consistent in these other plots. cov_tree seq_tree asm_tree

Perhaps solutions already exist, but I implemented my own method for plotting the partitions of regression decision trees. As in the classification examples linked by @michaelgruenstaeudl, the idea is to plot actual data against the values predicted by the decision tree. Since we already partitioned the sample data into training-testing, we are primarily interested in seeing how the testing data (withheld when fitting the model) performs. In addition, we will see the testing data for sequencing method and evenness score since there is less data there and only one notable interaction to plot. There exist no partition plots for the assembly method decision since there is only a single predictor, the partitioning is binary, and as stated above the value of it appears limited. cov_part1 cov_part2 cov_part3 seq_part1 seq_part2

@buddhathapa12 thank you for sharing some good insights and plots for descriptive statistics. In regard to regression or classification models for the data, feel free to use what is present in PREP_RegressionAnalysis_RunAnalyses.R or build something of your own. For example, if you are most familiar with linear regression models, you could go in that direction. In that instance, you might test for homoscedasticity, engage in feature selection (such as eliminating statistically insignificant features), and test for influential outliers.

As mentioned above, it appears that packages like parttree focus on plotting the partitions of classification trees. While it might be possible to adapt those solutions to regression trees, it is not something I have attempted. If you are determined to use parttree you could try creating some classification trees using the data, such as predicting assembly and sequencing method with the other variables resulting from filter_complete_genome(), or predicting the interaction of region and coding-noncoding division with the other variables resulting from filter_region_x_subset(). Note that the only changes necessary for a classification tree fit should be to use set_mode("classification") in the model and to use select_best(metric = "accuracy") (or other applicable criterion) when selecting the best hyperparameters.

michaelgruenstaeudl / PACVr