Machine Learning using R and Python

radaniba commented 8 years ago

Hello folks,

I parsed all issues here and although we keep talking about data science, I didn't find proposals for machine learning either using R or Python.

Who is willing to co-start a series of machine learning sessions where we cover the most used algorithms in data science for classification, regression or clustering, I can come up with a plan if people are interested.

Let me know

bkatiemills commented 8 years ago

I am super interested in this! I can't commit to actually delivering any sessions until May(ish), but that's ok - we can line these up as a summer series. Let me know what you're imagining and we'll pull something together.

radaniba commented 8 years ago

Awesome ! I am actually thinking about putting in practice all what you guys are doing for data munging techniques, public data extraction, cleaning .. towards answering real world problems.

We can also use data from Kaggle from their competitions (I think they are public)

We can structure sessions around (very raw thoughts but I can come up with a very clean and detailed plan):

Data related

Data collection, feature selection and feature engineering
Processing data for modeling : categorical feature handling, dealing with missing data etc ..
Data visualization techniques

Models related

Classification :
- Building models for predictions
- Classification of linear data
- Classification of non linear data
- Binary and multi-class predictions
Regression
- Handeling time series
- Predicting continuous outcome
- Different algorithms for regression

Evaluations and Metrics

Predictive accuracy for new data : generalization
Overfitting and Underfitting
The importance of cross validation
metrics for classification and regression

Depending on the demand and the time I can create a long or short plan, let's see feedbacks here first

minisciencegirl commented 8 years ago

Sign me up please!!

I've been using WEKA in a very simplistic fashion. It would be good to learn why you would choose certain machine learning model, when it's appropriate to use decision tree, random forest, SVM, ANN etc.

Agree with @BillMills that summer is probably a good time to start these sessions!

SimonGoring commented 8 years ago

@radaniba I was thinking about using the Kaggle Ocean Ship Logbooks dataset for my own session on Rstudio/leaflet coming up.

radaniba commented 8 years ago

@minisciencegirl : definitely ! that's exciting @SimonGoring : that a very rich topic to study indeed, why did you pick this one in particular ? (downloading the data to take a look at it)

SimonGoring commented 8 years ago

I study past climate & work with biological records, so this data set has both :) I'm not entirely sure what we could pull out from a machine learning perspective, but it's a rich spatio-temporal data set that has the added bonus of pirates (possibly).

On Thu, Feb 18, 2016 at 7:17 PM, Radhouane Aniba notifications@github.com wrote:

@minisciencegirl https://github.com/minisciencegirl : definitely ! that's exciting @SimonGoring https://github.com/SimonGoring : that a very rich topic to study indeed, why did you pick this one in particular ? (downloading the data to take a look at it)

— Reply to this email directly or view it on GitHub https://github.com/minisciencegirl/studyGroup/issues/101#issuecomment-186029019 .

jstaf commented 8 years ago

Oooh, please webcast this one, definitely interested in attending this one from afar... been meaning to learn this stuff for awhile now

Frikster commented 8 years ago

Where do I sign up!

Here's my wish list at any rate, and I'd really like to nail-down how to put concept into code:

PCA (Obviously)
Gaussian models: proposes phenomena is built from modules, each a single Gaussian in variable space
Auto-regressive Hidden Markov Models: proposes phenomena as autoregressive through variable space, and which transition from one to another with definable transition statistics

And all the ways of mixing... Auto-regressive mixture model, autoregressive model, Gaussian mixture model, Gaussian mixture model - hidden Markov model, Gaussian hidden Markov model... apparently all these are different. I'd like to learn how and how you go about knowing what to use for analysis. And most importantly how to effectively implement them in python.

radaniba commented 8 years ago

@Frikster thanks for the feedback. Obviously you see that a lot in your field :) and you're not alone. To answer quickly your request, yes we can add those as part of the data science pipeline a person should have to process data, for example PCA is commonly used for dimentionality reduction when we use several features to describe the data in hand. I will take this request ( and guys feel free to ask what you need to see in this course ) in consideration while preparing course material. Thanks again @Frikster

@kazi11 definitely ! We will come to this once we have something finalized :)

radaniba commented 8 years ago

Would love to hear from you about public data sets you would love to use for some machine learning application during these data science sessions. The reason I am asking you is that I want the lectures and the exercises to make sense to anyone.

Let me know if you have suggestions

SimonGoring commented 8 years ago

@radaniba I'm classifying a whole bunch of listserv messages (repo here: https://github.com/SimonGoring/ESA_Shiny) and having trouble getting error rates down below ~20%. I just looked at the repo and realized it's a bit of a mess though :)

Basically, taking messages, classifying a subset by job type and then trying to predict the rest. We've hand classified a bit less than 1% of all messages and have an error rate of around 20%. Just using randomForest though (https://github.com/SimonGoring/ESA_Shiny/blob/master/R/load_models.R), I didn't get any improvement over using a boosted regression tree. All in R.

radaniba commented 8 years ago

@SimonGoring that looks cool, not sure to understand what you're trying to do here though (classification ?) any clarification will be great with some examples (may be we can take discussion over your repo's issues)

minisciencegirl / studyGroup

Machine Learning using R and Python #101