rsangole / capstone_project

Predict 498 Capstone Project
3 stars 3 forks source link

Modelling Task List - Rahul #23

Closed rsangole closed 5 years ago

rsangole commented 5 years ago

Maintaining a task list for myself here.

Data Processing

EDA and Hypothesis Testing

Feature Engineering

Reading

Modeling

Feature Reduction Activities

Reporting

rsangole commented 5 years ago

@andrew3cooper @kapelinskim6 @stephenhage - modellers: you may benefit from this brainstorming list I'm making for myself too...

rsangole commented 5 years ago

Notes from reading this:

Modelling approach was in line with what I had in mind -- a regression model for predicting the number of mosq + a classification model to predict wnv presence.

Few ideas:

  1. Try log(# of mosq)
  2. new features - species-month, trap-month combination type variables
  3. 2012 data is wierdly high in the num of mosq -- either adjust for this, or accept it'll affect model performance
rsangole commented 5 years ago

image

andrew3cooper commented 5 years ago

I agree with the approach. I'm working in parallel on some of these issues.

I just constructed weekly time series at the trap level without imputation (a multiple time series object). It's informative. I repeated with weekly time series at the community area level. Also informative. We may be able to use these data without going all the way up to monthly level. We will benefit from some kind of clustering as you previously suggested @rsangole . I'd like to make a quick attempt to do this by simply combining adjacent communities when they have sparse data.

I'll see if I can push my R code and plots to GitHub on a work break later today.

andrew3cooper commented 5 years ago

Another quick note -- at a very high level -- I had assumed we'd be doing prediction using observed (future) weather data rather than forecasting without any future data. That's basically how kaggle had the competition set up as well, despite the fact that train/test splits were in alternating years.

I talked about using lagged trap results in another post. I think that's something we can come back to at the end as a stretch goal to demonstrate that the visualization platform can give within-season forecasts 1 to x many weeks in the future. There's real business (public health) value in doing so.