Open radaniba opened 8 years ago
I am super interested in this! I can't commit to actually delivering any sessions until May(ish), but that's ok - we can line these up as a summer series. Let me know what you're imagining and we'll pull something together.
Awesome ! I am actually thinking about putting in practice all what you guys are doing for data munging techniques, public data extraction, cleaning .. towards answering real world problems.
We can also use data from Kaggle from their competitions (I think they are public)
We can structure sessions around (very raw thoughts but I can come up with a very clean and detailed plan):
Data related
Models related
Evaluations and Metrics
Depending on the demand and the time I can create a long or short plan, let's see feedbacks here first
Sign me up please!!
I've been using WEKA in a very simplistic fashion. It would be good to learn why you would choose certain machine learning model, when it's appropriate to use decision tree, random forest, SVM, ANN etc.
Agree with @BillMills that summer is probably a good time to start these sessions!
@radaniba I was thinking about using the Kaggle Ocean Ship Logbooks dataset for my own session on Rstudio/leaflet coming up.
@minisciencegirl : definitely ! that's exciting @SimonGoring : that a very rich topic to study indeed, why did you pick this one in particular ? (downloading the data to take a look at it)
I study past climate & work with biological records, so this data set has both :) I'm not entirely sure what we could pull out from a machine learning perspective, but it's a rich spatio-temporal data set that has the added bonus of pirates (possibly).
On Thu, Feb 18, 2016 at 7:17 PM, Radhouane Aniba notifications@github.com wrote:
@minisciencegirl https://github.com/minisciencegirl : definitely ! that's exciting @SimonGoring https://github.com/SimonGoring : that a very rich topic to study indeed, why did you pick this one in particular ? (downloading the data to take a look at it)
— Reply to this email directly or view it on GitHub https://github.com/minisciencegirl/studyGroup/issues/101#issuecomment-186029019 .
Oooh, please webcast this one, definitely interested in attending this one from afar... been meaning to learn this stuff for awhile now
Where do I sign up!
Here's my wish list at any rate, and I'd really like to nail-down how to put concept into code:
And all the ways of mixing... Auto-regressive mixture model, autoregressive model, Gaussian mixture model, Gaussian mixture model - hidden Markov model, Gaussian hidden Markov model... apparently all these are different. I'd like to learn how and how you go about knowing what to use for analysis. And most importantly how to effectively implement them in python.
@Frikster thanks for the feedback. Obviously you see that a lot in your field :) and you're not alone. To answer quickly your request, yes we can add those as part of the data science pipeline a person should have to process data, for example PCA is commonly used for dimentionality reduction when we use several features to describe the data in hand. I will take this request ( and guys feel free to ask what you need to see in this course ) in consideration while preparing course material. Thanks again @Frikster
@kazi11 definitely ! We will come to this once we have something finalized :)
Would love to hear from you about public data sets you would love to use for some machine learning application during these data science sessions. The reason I am asking you is that I want the lectures and the exercises to make sense to anyone.
Let me know if you have suggestions
@radaniba I'm classifying a whole bunch of listserv messages (repo here: https://github.com/SimonGoring/ESA_Shiny) and having trouble getting error rates down below ~20%. I just looked at the repo and realized it's a bit of a mess though :)
Basically, taking messages, classifying a subset by job type and then trying to predict the rest. We've hand classified a bit less than 1% of all messages and have an error rate of around 20%. Just using randomForest though (https://github.com/SimonGoring/ESA_Shiny/blob/master/R/load_models.R), I didn't get any improvement over using a boosted regression tree. All in R.
@SimonGoring that looks cool, not sure to understand what you're trying to do here though (classification ?) any clarification will be great with some examples (may be we can take discussion over your repo's issues)
Hello folks,
I parsed all issues here and although we keep talking about data science, I didn't find proposals for machine learning either using R or Python.
Who is willing to co-start a series of machine learning sessions where we cover the most used algorithms in data science for classification, regression or clustering, I can come up with a plan if people are interested.
Let me know