mlandry22 / rain-part2

How Much Did it Rain Pt 2 - Kaggle Competition
0 stars 0 forks source link

Finding Outliers #8

Open JohnM-TX opened 9 years ago

JohnM-TX commented 9 years ago

So I've started looking at the 'problem within a problem', which is to identify IDs in the test set likely to have unreasonably high Expected(rainfall) values. As pointed out in the forums, these outliers are largely noise and are likely responsible for most of the MAE. Here is the contribution from a validation set using our best XGB model. .

image

Identifying even a subset of the outliers could significantly reduce MAE! Here's how I've approached the problem so far:

  1. From validation results and the training data, identify a subset of outliers for further study
  2. Train a binary classifier to predict whether an ID is an outlier or not based on common characteristics hidden in the data
  3. Apply the model to a validation set and for anything classified as an outlier, provide an alternative prediction for Expected(rainfall)
  4. Apply this composite model to a holdout set and compare MAE before and after

I've had good results in the lab, but no progress at all on the public LB, and can't say why. I'll include details of work to date in separate comments.

JohnM-TX commented 9 years ago

So here's some more detail on my approach to identify outliers... Looking at the training set grouped by ID, there is a nice cohort of around 3000 Ids that have an Expected rainfall value in the neighborhood of 2500 to 3500.

image

Rather than go after all the outliers, I chose to start with this range. There are certainly other choices.

To build the predictive model, I did the following:

Radar distance was the most important factor for the model. This made sense to me, because radar distance could be one way to identify a specific site, and that specific site could have a glitch that causes the high values.

Applying the model to a validation set (again, without any Expected(rainfall) values), it seems to work. Here's the ROC curve, AUC of around .95.

image

I checked the validation set to see if there was visible separation, and it seems like there is. image

My interpretation here is that above 0.5 there are very few false positives so we should be able to treat these as outliers. Below 0.4, the positives get lost and we would be in danger of pulling in too many false negatives. So if we stay toward the right side of the range, we should be able to ID some outliers and treat them differently.

JohnM-TX commented 9 years ago

Wrapping up, I used something like stacking on the validation set to get a lower MAE.

I used Excel solver the first time to iterate through cutoff values and rainfall estimates, giving around 0.3 and a value of 2800. This lowered MAE on the validation set from 23.5 to around 18. However, on the LB, MAE went up quite a bit :confounded:

The second time around, I played it "safe" and used 0.5 as a cutoff with 1000 as the alternative value. I even applied the model to a double-secret holdout set and got good results (around 21). But again, the LB did not like it. Double :confounded: !

So anyway, that was my experience yesterday. I don't know if it's a simple mistake, an error in approach/logic/assumptions, or data peculiarity from the LB. In any case there is a lot of potential here so I'm hoping we can make headway with it!

Signing off for now, will probably resurface in a couple days. Glad to respond, etc, in the meantime.

mlandry22 commented 9 years ago

Yup, you're right on the approach. The best finish Thakur and I had together was where we handled the problem like this. In that case, he did a classifier which returned amazingly accurate predictions of a value other than 0. And then I had a regression to figure out the number between 0 and 100. But knowing that he was going to send me only those above 0 (with F1 of 0.99 or so, I think), I trained on only the subset of the original data that was above 0. Worked out great. It worked out great because his classifier was extremely precise. And the distribution on which it was built did work out to the other days.

So what is going on here? First, not positive. Everything seems good, especially the double-blind. It's possible that 0.95 isn't high enough. After all our initial GBM tried to solve these as well, so it is jointly guessing on these and on the regulars. And deep XGBoost models are quite good at naturally separating the problem.

One thing is to ensure the blinds are working. As you say, it's just like stacking, so hopefully you're doing the stacking correct. I imagine so, but just in case.... We want to apply N fold CV to the first model. 10 is a good number. From each, we want to gather the predictions against the fold that wasn't used in training. So 10 would give us ten 1/10 chunks that together provide predictions on the entire data set, but where the model used to make each point had never seen the point in question. That data set is the one to use for the predictions at the second level. When doing it for the leaderboard, you originally had 10 models, but you don't want 10 predictions for each test record. So you fit the same "architecture" on the full data to get an 11th model. That is the one you use for the leaderboard. In noisy scenarios, I've run 30 bags for each of those. This might warrant that. So what that means is that we have 330 models. 30 models per fold, where each of the 30 is on a differently bagged version of that fold (but predicting on the same holdout 10%). And then 30 for the entire data set. That seems (and is) extreme. But it might be necessary to ensure this thing doesn't fit the overfit parts from the first level models.

If we want to reverse course and go with far simpler, then the data size is big enough that we can probably get away using a single model. What that means is that we just do a 90/10 or 80/20 for that first level. And we use that model to predict on the 10 or 20 and also for the full set. Then we only have 10% or 20% to "train" the model selector, but for such a simple task, that certainly might be enough.

Nice work. Hopefully the infrastructure is good and we can spot some simple tweaks to get a good submission out of it.

JohnM-TX commented 9 years ago

I tried some of Thakur's ideas to find outliers, but possibly the features aren't strong enough or I'm not skilled enough with the tools. I ran k-means on top of t-sne and t-sne on top of k-means, but still no meaningful separation from what I can tell. tsne2