Modelling Task List - Rahul

rsangole commented 5 years ago

Maintaining a task list for myself here.

Data Processing

[x] define groups of variables to use as predictors
[x] identify near-zero-variance and zero-variance predictors
[x] train-val-test splits
[ ] convert daily data into monthly averages

EDA and Hypothesis Testing

[x] what lag terms do we need? perform time series analysis with acf/pacf plots
[x] for weather data - can we stick to just using ohare data?
[x] for just ohare data, how much weather information do we need? opportunity to reduce vars using PCA?
[ ] what does clustering show us? anything useful?

Feature Engineering

[x] weight of evidence variables for large-level factor variables
[ ] introduce appropriate lag terms

Reading

[x] finish reading the antonUBC kaggle code for inspiration

Modeling

[x] setup the code structure to use mlr correctly (especially to use the model comparison codes)

Feature Reduction Activities

[x] lasso regression based feature selection
[x] random forest's var imp plot based feature selection
[x] information value based feature reduction

Reporting

rsangole commented 5 years ago

@andrew3cooper @kapelinskim6 @stephenhage - modellers: you may benefit from this brainstorming list I'm making for myself too...

rsangole commented 5 years ago

Notes from reading this:

Modelling approach was in line with what I had in mind -- a regression model for predicting the number of mosq + a classification model to predict wnv presence.

Few ideas:

Try log(# of mosq)
new features - species-month, trap-month combination type variables
2012 data is wierdly high in the num of mosq -- either adjust for this, or accept it'll affect model performance

rsangole commented 5 years ago

andrew3cooper commented 5 years ago

I agree with the approach. I'm working in parallel on some of these issues.

I just constructed weekly time series at the trap level without imputation (a multiple time series object). It's informative. I repeated with weekly time series at the community area level. Also informative. We may be able to use these data without going all the way up to monthly level. We will benefit from some kind of clustering as you previously suggested @rsangole . I'd like to make a quick attempt to do this by simply combining adjacent communities when they have sparse data.

I'll see if I can push my R code and plots to GitHub on a work break later today.

andrew3cooper commented 5 years ago

Another quick note -- at a very high level -- I had assumed we'd be doing prediction using observed (future) weather data rather than forecasting without any future data. That's basically how kaggle had the competition set up as well, despite the fact that train/test splits were in alternating years.

I talked about using lagged trap results in another post. I think that's something we can come back to at the end as a stretch goal to demonstrate that the visualization platform can give within-season forecasts 1 to x many weeks in the future. There's real business (public health) value in doing so.

rsangole / capstone_project

Modelling Task List - Rahul #23