Fog-Net training data looks imbalanced to me

pflashgary commented 2 years ago

Have a look at this link. Does it look like a balanced training dataset to you? Not sure what VIS (visibility) unit is but in most cases is 10.0.

njamshidpour commented 2 years ago

No, it does not at all. I looked at the 2009 and 2020 datasets and calculated the minority to majority class ratio for them. Whatever the visibility unit is, it has been categorized into five classes (0-4). I assumed class 0 is fog (minority class) and 4 is non-fog (majority class). The imbalance ratio is about 2% and 5% for the 2009 and 2020 datasets. As you said before it is a SEVERE intrinsic imbalanced dataset since fog is a rare phenomenon. We have to employ imbalanced ML techniques, like oversampling, cost-sensitive learning, threshold moving, etc. I read a little bit about it last week, there are some simple solutions to handle these datasets, I hope they work for our problem, too.

pflashgary commented 2 years ago

You are right, they have't done any resampling. I can see they have stratified based on the prognosis period. Imagine you are looking at day 20100102 at 0:00, 06:00, 12:00, 18:00. The forecasts of visibility come from:

Datetime	Basetime	prognosis
2010010200	2010010118	6
2010010206	2010010200	6
2010010212	2010010206	6
2010010218	2010010212	6

They have input and target for each prognosis (6, 12, 24). That means we are verifying the forecasts in the next 6 (or 12 pr 24) hours.

What I'm alluding to is the dataset has 4 entries for each day of a year with no down/up sampling.

One could ask why stratify by the prognosis period.? The reason is we can loosely say the physics for each prognosis is similar. This is to avoid mixing forecasts for the next 6 hours with ones for the next 12 hours.

pflashgary commented 2 years ago

This is an awesome reference for dealing with imbalanced datasets.

pflashgary / fog-forecast

Fog-Net training data looks imbalanced to me #7