nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
319 stars 78 forks source link

Resampling with label uniformity and user uniformity #39

Open qa-aleksejs-fomins opened 1 year ago

qa-aleksejs-fomins commented 1 year ago

Hi,

I have a regression problem, so the label is a single floating-point number within a well-defined range (e.g. [0, 1]). The label distribution is non-uniform: namely, there is markedly less data at the edges, but also in the very middle of the range. So far, a classical problem for SMOGN. However, I sample data from multiple users, and there is also a huge imbalance in amount of data among users. I would prefer that all users are well-represented in the training set in addition to balancing the label range distribution. Thus, I would prefer that the algorithm is aware of user labels, and tries to undersample users with a lot of data and preserve or oversample users with little data. Is this currently possible? Do you have suggestions?