I have a regression problem, so the label is a single floating-point number within a well-defined range (e.g. [0, 1]). The label distribution is non-uniform: namely, there is markedly less data at the edges, but also in the very middle of the range. So far, a classical problem for SMOGN. However, I sample data from multiple users, and there is also a huge imbalance in amount of data among users. I would prefer that all users are well-represented in the training set in addition to balancing the label range distribution. Thus, I would prefer that the algorithm is aware of user labels, and tries to undersample users with a lot of data and preserve or oversample users with little data. Is this currently possible? Do you have suggestions?
Hi,
I have a regression problem, so the label is a single floating-point number within a well-defined range (e.g. [0, 1]). The label distribution is non-uniform: namely, there is markedly less data at the edges, but also in the very middle of the range. So far, a classical problem for SMOGN. However, I sample data from multiple users, and there is also a huge imbalance in amount of data among users. I would prefer that all users are well-represented in the training set in addition to balancing the label range distribution. Thus, I would prefer that the algorithm is aware of user labels, and tries to undersample users with a lot of data and preserve or oversample users with little data. Is this currently possible? Do you have suggestions?