Further workshop material

hugobowne commented 6 years ago

First draft of material is here: https://github.com/hugobowne/machine-learning-in-r/blob/master/ML_R_draft.Rmd

You can find the resulting html here: https://github.com/hugobowne/machine-learning-in-r/blob/master/ML_r_draft.html

I have decided to develop the material in an R markdown notebook, as it provides an easy workflow.

I will create another issue in this repo about converting it into a Carpentry lesson, which we can do after getting the material close to the final product.

Before this, I need to fill in some gaps in the motivation/explanation.

We also need to add more material.

I currently have unsupervised learning & supervised learning (for classification), both using the UCI breast cancer diagnosis dataset.

In this issue, I'm requesting everybody's thoughts on adding sections on:

[x] classification using a dataset more relevant to the attendees, ideally a genomics/bioinformatics dataset
[x] regression challenges in ML, i.e., predicting a continuously varying target variable instead of a category ( i know several datasets we could use for this, but none are relevant to bioinformatics so help here requested)
[x] regularized regression and variable selection, e.g., using glmnet

I welcome volunteers to work on each of these. Thanks!

jorgepda commented 6 years ago

I found this dataset from this paper and been running a couple of tests and seems to work well. The dataset is comprised of 3005 single-cell RNA-seq mouse samples, where the classified cell ("level1class" ) belong to one of nine classes.

I did KNN (n=5) with and without PCA on "level1class" and got the following results. I think it might be interesting to discuss how doing PCA affects execution time and precision of the model. Let me know your thoughts.

Without PCA

Execution time: 2,620 ms

	precision	recall	f1-score	support
astrocytes_ependymal	0.88	0.20	0.33	0.70
endothelial-mural	0.75	0.04	0.08	72
interneurons	1.00	0.34	0.51	85
microglia	1.00	0.11	0.21	35
oligodendrocytes	0.42	0.99	0.59	249
pyramidal CA1	0.70	0.67	0.68	261
pyramidal SS	1.00	0.06	0.12	130
avg/total	0.72	0.53	0.46	902

With PCA

Execution time: 17 ms

	precision	recall	f1-score	support
astrocytes_ependymal	0.88	0.76	0.82	70
endothelial-mural	0.90	0.49	0.63	72
interneurons	0.98	0.54	0.70	85
microglia	1.00	0.11	0.21	35
oligodendrocytes	0.73	0.98	0.84	249
pyramidal CA1	0.62	0.98	0.76	261
pyramidal SS	1.00	0.04	0.07	130
avg/total	0.80	0.71	0.65	902

mouse_data.csv.gz

hugobowne commented 6 years ago

@jorgepda I like the idea.

It's also cool to introduce precision as I've only dealt with accuracy so far in the material.

One question is how to run the workshop when we have several datasets but I think @JasonJWilliamsNY can advise us on this when we walk him through the material developed.

@jorgepda, can you share reproducible code that produces the above and that you think would work well in the workshop (ideally it would use the same function/API that I use here so that we're not teaching too many different ways to do one thing)? Please do so in a branch in this repository (feel free to do so in a new file).

I've opened another issue for you all to confirm that my code works on your computer.

hugobowne commented 6 years ago

ok team, we had plans to have next iteration of material ready for @JasonJWilliamsNY to check out today. We may be a bit behind schedule.

@jorgepda, when can you have first iteration of the "single-cell RNA-seq mouse samples" data material ready?

I can relatively easily write the regression material by mid next week but, as I mentioned above, I don't know of any datasets relevant to genomics. @jorgepda, @Kapeel , any suggestions here?

I can probably then create the glmnet material, which will be a nice addition, but this may need to come later, time permitting.

jorgepda commented 6 years ago

Sorry for the tardiness, but full disclosure, I ran the classification in Python to make sure the database yielded good results, and it took my a lot longer to do it in R for some pesky errors I couldn't get rid of. The reproducible example is here.

@hugobowne For the genomics dataset with a continuously varying target variable, what ML method were you thinking of using?

hugobowne commented 6 years ago

no worries, @jorgepda.

I have made a first pass at the regression material in this branch.

I used linear regression and then regularized regression but feel free to use anything else also.

To decouple issues (the development of regression/glmnet material from the introduction of genomics data), I'm going to close this issue now and open one specifically to do with material using genomics data.

I'll also open another issue for me to take another pass at all the material that I've developed.

hugobowne / machine-learning-in-r

Further workshop material #2

Without PCA

With PCA