vc1492a / tidd

An approach for detecting tsunamis using anomaly detection anomalies on sTec d/dt data from orbiting GPS satellites.
Other
5 stars 1 forks source link

Address class imbalance #72

Closed vc1492a closed 3 years ago

vc1492a commented 3 years ago

One of the classic problems in machine learning is class imbalance, or when one class has many more observations than the other. This is common in anomaly detection problems.

We see that issue in our dataset. Some strategies are to oversample the minority class (anomalies), undersample the majority class (normal), or to address the class imbalance in some way during model training (through class weights or some other means).

One of the benefits of balancing the data via under and/or over sampling is that it improves the interpretability of the accuracy metric - no one cares if you're accuracy is 99% if 99/100 of the observations are normal (e.g. you could always predict the majority class and have a high accuracy). I also imagine that there can be a ton of variability in the model training process and results from adjusting the weights, and in more unpredictable ways that changing the data itself.

vc1492a commented 3 years ago

@hamlinliu17 let's work on this one together. After I upload the data we will use in the experiments to AWS S3 in #65, we can figure out how to best handle the class imbalance. Let me know if you have any ideas that may come up in any reading you do etc.

vc1492a commented 3 years ago

@hamlinliu17 based on our conversations and some I have had with a few others, I believe under-sampling the data - or reducing the size of the majority class - would be the best approach.

I'll create a notebook that captures all of the file paths and totals the number of images we have for each class to use in training (and testing). Based on what those totals look like, we can decide how aggressive our undersampling should be. After that, we'll a certain percentage of file paths as those which should be deleted, and we will delete those from a duplicated dataset to create the new one.

I'll get started by creating a local copy of the modeling data to be balanced and will keep you updated on progress.

vc1492a commented 3 years ago

This will be covered in the feature/validate_data branch and later pulled into dev.

vc1492a commented 3 years ago

I created a balanced dataset! The performance is already better and while we won't be able to do a 50% split between each of the classes (since anomalies are exceedingly rare), this puts us on a better path towards getting better performance.

Once I push the large dataset to S3, I'll do the same with the balanced dataset and will commit my updates.

vc1492a commented 3 years ago

Balanced dataset and unbalanced dataset tar files pushed to AWS S3. Updated instructions are now in the feature/validate_data branch. @hamlinliu17, once you validate the data download works in #79 feel free to close this issue. Thanks!

vc1492a commented 3 years ago

Hey @hamlinliu17 going to close this issue since we have both achieved pretty good performance with the balanced dataset. Just another note to make sure to only use the training data in the model training (and not the validation data, I had it incorrectly specified in the notebook at first).