Figure: Statistical Techniques

cgreene commented 4 years ago

It would be grand to get a figure on the statistical techniques we discuss and to link those to how they address challenges in the rare disease space.

jaybee84 commented 3 years ago

This is a tentative sketch for the figure depicting key takeaways from the newly rewritten "Manage model complexity" section.

cc: @jaclyn-taroni @allaway @cgreene

We can use this as a starting point for the final figure for this section based on your comments. Happy to modify as needed.

jaybee84 commented 3 years ago

What do we intend to communicate via the figure? -- The main message is that applying machine learning to rare disease datasets can lead to complex and misinterpreted models due to scarcity of data points and other challenges associated with this kind of data. But we can avail of various statistical techniques to make simple and stable models that capture the essential and relevant patterns in rare disease data. A tentative sketch of the figure is presented in the comment above. The "person" in the figure represents patients, the small spheres and ovals represent features (e.g. genes, variants, symptoms, etc) associated with a patient sample.

If there are multiple pieces of information we are trying to present (see the first point), what is the one core piece of information that the audience should walk away with? -- Using specific statistical techniques that can mitigate the challenges posed by small and heterogenous datasets is essential for the successful application of machine learning to rare diseases

A list of or brief description of concepts to familiarize herself with -- "Manage model complexity" section of the manuscript outlines the main strategies captured in this figure

dvenprasad commented 3 years ago

I spilt this up to individual methods for the purposes of illustrating them. We can combine them once we are happy with how the individual methods are presented.

They are very generic. As we iterate, we can make them more specific.

Bootstrapping

bootstrapping

Ensemble Learning

I have annotated my comments.

ensemble-learning

Regularization

For this I used the same illustration as dimension reduction. It seemed to me that both are reducing feature space and that was what we wanted to show. Also, someone (maybe Robert? not sure) mentioned on the call today about having two levels of abstraction for #116 and maybe here is a good place to present a more abstracted figure?

regularization

One-class-at-a time

I have annotated my comments.

one-class-at-a-time

jaybee84 commented 3 years ago

Thanks @dvenprasad ! Below are my first thoughts:

Bootstrapping : it seems to me that in the depiction the features are being shuffled (please correct me if my understanding is not right). In actuality, bootstrap shuffles the data points (i.e. takes various drawings of the same samples like picking balls from a bag with replacement) but the features are left untouched. -- Following on the rectangle box (few samples/ many features) vs square box (many samples/ many features) analogy from today morning, bootstrap is a way to turn a rectangular box into a square box (where each side is equal to the long side of the rectangle)
Ensemble learning : Would the open circles be considered features that the model is considering ?
Regularization decreases feature space by penalizing the models that consider too many features and thus selecting models that select few important features. Again taking the rectangle box analogy, it helps pick the models that use a square box (where each side is equal to the short side of the rectangular box) instead of the ones that try to use the rectangular box.
One class at a time: Here I think it may be helpful to divide each cube into 4 or more colored sections (a discrete gradient of 4 shades) each shade depicting a "class", and then use the simple model to separate one sub-color (e.g. the darkest shade) from all the datasets at each step, instead of a complex model separating all shades at the same time.

allaway commented 3 years ago

Bootstrapping : it seems to me that in the depiction the features are being shuffled (please correct me if my understanding is not right). In actuality, bootstrap shuffles the data points (i.e. takes various drawings of the same samples like picking balls from a bag with replacement) but the features are left untouched. -- Following on the rectangle box (few samples/ many features) vs square box (many samples/ many features) analogy from today morning, bootstrap is a way to turn a rectangular box into a square box (where each side is equal to the long side of the rectangle)

I agree with @jaybee84 here- and important to note that the sampling is with replacement (at least for all of the bootstrapping i'm familiar with....:) ) so you typically end up with replicates of some samples for every "bootstrap".

dvenprasad commented 3 years ago

Bootstrapping: Yes, I was shuffling features. I've redone them with your notes. I'm shuffling the colors on the cube to indicate the shuffling of data points ( I do want to add circles within the squares as data points and we can change the colors of that instead of the sides on the cube). I'm not sure how to show how each side is the length of the longer side.

bootstrapping

Ensemble learning
Yes, each circle could be considered as a feature.

Regularization PXL_20201207_183721344

Model A has learned from a rectangle box and Model B has learned from a cube. So is Regularization taking these two models and ranking them? So the end result would be that Model B is ranked higher than Model A?

dvenprasad commented 3 years ago

One class at a time I've depicted the output as the model being able to tell which class it has learned and treat the rest of the classes as a A class I don't know

one-class-at-a-time

dvenprasad commented 3 years ago

Okay, took another pass at bootstrapping and ensemble learning after Monday's discussion.

Bootstrapping

Couple of notes:

Shapes indicate dataset source, color indicates classes
For the aggregate and resampled datasets, I kept the colors together because it makes it easier to track what is changing. The number of classes are the same but the samples can be from another source (I hope this is correct)

bootstrapping

Ensemble Learning

I've kept the average health of the modes similar across the 3 runs i.e 2 good health model and 1 poor health model. But my question is can you have 3 average-ish models and still get a good health combined model or have 2 poor health model and 1 good health model and get a good combined health model? Do we want to show that kind of variation?

ensemble-learning

allaway commented 3 years ago

bootstrapping figure:

I really like this! My only thought would be to add the individual models that are created during each bootstrap to help the reader understand how each bootstrap is contributing some knowledge to the final aggregate model.

here's a sloppy mock of what I was thinking: bootstrap

allaway commented 3 years ago

re ensemble modeling:

I've kept the average health of the modes similar across the 3 runs i.e 2 good health model and 1 poor health model. But my question is can you have 3 average-ish models and still get a good health combined model or have 2 poor health model and 1 good health model and get a good combined health model? Do we want to show that kind of variation?

Here's an example of actually combining models:

The first box is our "best" model alone. The 2nd is the "best + 2nd best", the 3rd is "best + 2nd best + 3rd best" and so on....

The blue and yellow boxes are anything that are substantially "better health" than the the best model alone. The red boxes are statistically indistinguishable from the best model alone. So, if we combine 'good health' models with each other, the resulting ensemble is better, but if we start adding too many models in that are of 'poor health' we actually don't get a model that's better than the sum of it's parts.

I'm not sure if this is worth conveying in the figure, but I think that it might be beyond the scope of this manuscript because it's probably variable based on the problem, and I doubt this is specific to rare disease modeling.

Also, I'm a little unsure of what the "runs" indicate? This looks like an ensemble of ensembles (which is still an ensemble, but might be unnecessary to convey the concept?)

jaybee84 commented 3 years ago

Re: bootstrap

suggest changing "aggregate" to "harmonize" :)

jaybee84 commented 3 years ago

Re: ensemble

jaybee84 commented 3 years ago

@dvenprasad I added a rough sketch re: ensemble learning in the above comment. Please let me know if you cannot view it or have questions.

dvenprasad commented 3 years ago

Regularization

regularization

jaybee84 commented 3 years ago

I like the above figure... just a few notes:

We might not want to use circles as a dataset indicator since we are using circles as features
To be consistent with the dimension reduction fig (and the heatmap panel), we might want the datasets to be rectangles in portrait mode ie many features as rows, and few samples as columns
It may just be me but in this figure it seems like the top two rectangles are making the unhealthy model and bottom two rectangles are making the healthy model. It would be ideal to show that all 4 datasets together lead to the top model when not regularized, and bottom model when regularized.

dvenprasad commented 3 years ago

Changed yellow->purple because the contrast was really bad with white

Bootstrapping Updated it based on feedback.-> Added in sad models per resampling of dataset.

bootstrapping