Closed hugobowne closed 6 years ago
I found this dataset from this paper and been running a couple of tests and seems to work well. The dataset is comprised of 3005 single-cell RNA-seq mouse samples, where the classified cell ("level1class" ) belong to one of nine classes.
I did KNN (n=5) with and without PCA on "level1class" and got the following results. I think it might be interesting to discuss how doing PCA affects execution time and precision of the model. Let me know your thoughts.
Execution time: 2,620 ms
Execution time: 17 ms
@jorgepda I like the idea.
It's also cool to introduce precision as I've only dealt with accuracy so far in the material.
One question is how to run the workshop when we have several datasets but I think @JasonJWilliamsNY can advise us on this when we walk him through the material developed.
@jorgepda, can you share reproducible code that produces the above and that you think would work well in the workshop (ideally it would use the same function/API that I use here so that we're not teaching too many different ways to do one thing)? Please do so in a branch in this repository (feel free to do so in a new file).
I've opened another issue for you all to confirm that my code works on your computer.
ok team, we had plans to have next iteration of material ready for @JasonJWilliamsNY to check out today. We may be a bit behind schedule.
@jorgepda, when can you have first iteration of the "single-cell RNA-seq mouse samples" data material ready?
I can relatively easily write the regression material by mid next week but, as I mentioned above, I don't know of any datasets relevant to genomics. @jorgepda, @Kapeel , any suggestions here?
I can probably then create the glmnet
material, which will be a nice addition, but this may need to come later, time permitting.
Sorry for the tardiness, but full disclosure, I ran the classification in Python to make sure the database yielded good results, and it took my a lot longer to do it in R for some pesky errors I couldn't get rid of. The reproducible example is here.
@hugobowne For the genomics dataset with a continuously varying target variable, what ML method were you thinking of using?
no worries, @jorgepda.
I have made a first pass at the regression material in this branch.
I used linear regression and then regularized regression but feel free to use anything else also.
To decouple issues (the development of regression/glmnet material from the introduction of genomics data), I'm going to close this issue now and open one specifically to do with material using genomics data.
I'll also open another issue for me to take another pass at all the material that I've developed.
First draft of material is here: https://github.com/hugobowne/machine-learning-in-r/blob/master/ML_R_draft.Rmd
You can find the resulting html here: https://github.com/hugobowne/machine-learning-in-r/blob/master/ML_r_draft.html
I have decided to develop the material in an R markdown notebook, as it provides an easy workflow.
I will create another issue in this repo about converting it into a Carpentry lesson, which we can do after getting the material close to the final product.
Before this, I need to fill in some gaps in the motivation/explanation.
We also need to add more material.
I currently have unsupervised learning & supervised learning (for classification), both using the UCI breast cancer diagnosis dataset.
In this issue, I'm requesting everybody's thoughts on adding sections on:
glmnet
I welcome volunteers to work on each of these. Thanks!