nickkunz / smogn

Synthetic Minority Over-Sampling Technique for Regression
https://pypi.org/project/smogn
GNU General Public License v3.0
319 stars 78 forks source link

Additional variables #4

Open neverfox opened 4 years ago

neverfox commented 4 years ago

I was thinking it would be useful to be able to specify variables in the data that are neither the target variable nor variables used to perform the resampling, and would just be passed through. A practical use case are models that use offsets or data that contains IDs etc that might be useful for building cross-validation folds with matching unsampled data. Thoughts?

nickkunz commented 4 years ago

Hello! Thank you for the feedback. I think that is a great idea. However, what could be done in this case is to simply specify a feature which contains IDs as a categorical data type. In that case, a random ID would be selected from the existing ones and used for generating synthetic observations. Am I interpreting your suggestion correctly?

shaddyab commented 4 years ago

If a random ID will be generated then you can pass a subset of the data which excludes the IDs to the algorithm, then add random IDs to the output. This is a simple workaround.

neverfox commented 4 years ago

@nickkunz Actually what I'm thinking of here is unmodified data not random selection, i.e. keeping exactly the values of the seed example on all of the synthetics. To take the Id example, I don't want just any Id but rather the seed Id associated with each synthetic to, say, link it back to other data. Perhaps I don't fully grasp the subtleties of the algorithm but I believe every synthetic record comes from no more than one real record at the end of the day right? In short, specifying which columns should not be modified when generating synthetic records.

nickkunz commented 4 years ago

@neverfox Oh! I understand what you mean now. As of now, every synthetic ID would be selected at random from an existing / actual ID. However, what you're suggesting is to map each synthetic ID to its corresponding interpolated value. Am I understanding you correctly?