Figure: Feature selection and dimension reduction

jaclyn-taroni commented 4 years ago

Idea from #114 that we are not 100% sold on - figure that covers "there are too many features", feature selection and feature learning, and maybe the specific use case of visualizing batch effects.

jaybee84 commented 4 years ago

assigned to @allaway since this refers to heterogeneity and dimension reduction section

allaway commented 3 years ago

What do we intend to communicate via the figure?

That the "curse of dimensionality" is a challenge for rare disease research. One way to overcome this is by aggregating datasets, specimens, etc, to boost the number of samples available for analysis. However, this can lead to additional issues, like non-biological variability. Dimensionality reduction methods (and subsequent correction of technical artifacts) and representation learning can help us get around these issues.

If there are multiple pieces of information we are trying to present (see the first point), what is the one core piece of information that the audience should walk away with?

Dimensionality reduction is a powerful tool for the analysis of rare disease datasets.

A list of or brief description of concepts to familiarize herself with

curse of dimensionality dimensionality reduction (and representation learning) batch correction

This section should give a pretty high level intro on these concepts, so if they don't explain it adequately, it probably means we need to add more info!

dvenprasad commented 3 years ago

dimension-reduction

jaclyn-taroni commented 3 years ago

I wanted to follow up on the not very good or informative drawing I showed over Meet yesterday with a slightly better drawing, where we have a heatmap with annotation bars on the left (I didn't fill in the colors that represent the values) and the scatterplot on the right where points are individual samples. My thought is that we could put some whitespace between the features that show different patterns to make the patterns easier to discern.

☝️ this doesn't yet address our concerns of not being particularly specific to rare diseases or from getting from the complex -> simple representations of concepts but figured I would put it out there before I spill coffee on the original drawing.

allaway commented 3 years ago

Thanks @dvenprasad and @jaclyn-taroni. I like this sketch (btw - how did you draw this - a tablet?). I like that PC1/Feature 1 clearly shows the technical variation while PC2/Feature 2 highlights that you can still get meaningful separation of biological classes in other dimensions.

To add some "rare disease" flavor to this - perhaps one or more of the datasets could be restricted to one class, or have a more uneven distribution of classes. That's a pretty common scenario that - in my experience at least - is extra-common in rare disease datasets.

Just food for thought, but i think the mini-heatmap, while helpful for showing the reader how features can be reduced to identify classes or batches, might not convey why dimensionality reduction is a useful strategy (in other words, in reality, the classes are difficult to identify when looking at the full feature set)...is this something we think is important to convey in this figure?

jaclyn-taroni commented 3 years ago

To add some "rare disease" flavor to this - perhaps one or more of the datasets could be restricted to one class, or have a more uneven distribution of classes. That's a pretty common scenario that - in my experience at least - is extra-common in rare disease datasets.

Oops, I totally set out to communicate the part about classes not being represented in all datasets (gray dataset in heatmap) and then got carried away making dots! (I recently got an iPad 😄 ) If we like this overall concept, I think there's room to incorporate these ideas (and we should!).

i think the mini-heatmap, while helpful for showing the reader how features can be reduced to identify classes or batches, might not convey why dimensionality reduction is a useful strategy (in other words, in reality, the classes are difficult to identify when looking at the full feature set)...is this something we think is important to convey in this figure?

I agree that that would be useful to convey in this figure. If we stuck with the mini-heatmap, would including one set of features that "looks" very noisy address this point? Another idea would be to borrow some of the concepts of a "health bar" from the few-shot learning panel where you have a "healthier" model when you reduce the feature space.

dvenprasad commented 3 years ago

@jaybee84 and I spoke a bit more on Monday and expanded on the figure @jaclyn-taroni posted above.

Screen Shot 2020-12-09 at 10 41 56 AM

The sample- feature box: Can it just be a solid block or are there some nuances which need to be represented? (i.e dataset source? or classes?)

allaway commented 3 years ago

I guess I don't understand the first box - isn't it conveying the same thing as the second set? In @jaclyn-taroni 's sketch, the dimensionality reduction happens at the arrow between the heatmap and the scatterplot. I think that the final box is helpful, but it represents the output of the black arrow (ie all of the new dimensions), and the scatterplot would still be the final step (plotting new f1 and f2)

jaybee84 commented 3 years ago

I think the first box (sample-feature box) should recap the 4 dataset boxes with colored and shaped dots from #106 ... The first box I think is helpful in holding the reader's hands to make the connection that the heatmap is talking about the datasets.

Also it may be helpful to have the rows as features and columns as samples in all boxes to be consistent between boxes and heatmaps and such

dvenprasad commented 3 years ago

Notes:

1) I am not 100% confident that I drew the heatmap correctly

2) I added the shapes next to the dataset indicators on the heatmap to help people make the association between shape and dataset

3) For the new feature graph I'm not sure how to write the weightage of features. Is it 10% from f1-f33000, 30% from f33001-f66000, and 60% from f66001-f100000?

dimensionality-reduction

allaway commented 3 years ago

Hi @dvenprasad - thanks!

Re 1: This looks correct to me! Maybe it would help to just have f1-f1000000 (rather than f1-f330000, f330001..etc) in the middle panel to simplify.

Re 3: I think reasonable percentages would be new feature 1 = 60%, new feature 2 = 30%. I think this is what you are saying already?

jaybee84 commented 3 years ago

re 3: An option can be new feature 1 = (0.1 f1 + 0.001 f2 + 0.9 f3 ... + 0.07 f100000)

dvenprasad commented 3 years ago

Okay thank you! @jaybee84 yes! That is what I wanted to add.

New version:

simplified middle heatmap feature label.
Added the weightage of features on the new feature graph

dimensionality-reduction

jaybee84 / ml-in-rd

Figure: Feature selection and dimension reduction #116