Open pflashgary opened 2 years ago
No, it does not at all. I looked at the 2009 and 2020 datasets and calculated the minority to majority class ratio for them. Whatever the visibility unit is, it has been categorized into five classes (0-4). I assumed class 0 is fog (minority class) and 4 is non-fog (majority class). The imbalance ratio is about 2% and 5% for the 2009 and 2020 datasets. As you said before it is a SEVERE intrinsic imbalanced dataset since fog is a rare phenomenon. We have to employ imbalanced ML techniques, like oversampling, cost-sensitive learning, threshold moving, etc. I read a little bit about it last week, there are some simple solutions to handle these datasets, I hope they work for our problem, too.
You are right, they have't done any resampling. I can see they have stratified based on the prognosis period.
Imagine you are looking at day 20100102
at 0:00, 06:00, 12:00, 18:00. The forecasts of visibility come from:
Datetime | Basetime | prognosis |
---|---|---|
2010010200 | 2010010118 | 6 |
2010010206 | 2010010200 | 6 |
2010010212 | 2010010206 | 6 |
2010010218 | 2010010212 | 6 |
They have input and target for each prognosis (6, 12, 24). That means we are verifying the forecasts in the next 6 (or 12 pr 24) hours.
What I'm alluding to is the dataset has 4 entries for each day of a year with no down/up sampling.
One could ask why stratify by the prognosis period.? The reason is we can loosely say the physics for each prognosis is similar. This is to avoid mixing forecasts for the next 6 hours with ones for the next 12 hours.
This is an awesome reference for dealing with imbalanced datasets.
Have a look at this link. Does it look like a balanced training dataset to you? Not sure what VIS (visibility) unit is but in most cases is 10.0.