Open neverfox opened 4 years ago
Hello! Thank you for the feedback. I think that is a great idea. However, what could be done in this case is to simply specify a feature which contains IDs as a categorical data type. In that case, a random ID would be selected from the existing ones and used for generating synthetic observations. Am I interpreting your suggestion correctly?
If a random ID will be generated then you can pass a subset of the data which excludes the IDs to the algorithm, then add random IDs to the output. This is a simple workaround.
@nickkunz Actually what I'm thinking of here is unmodified data not random selection, i.e. keeping exactly the values of the seed example on all of the synthetics. To take the Id example, I don't want just any Id but rather the seed Id associated with each synthetic to, say, link it back to other data. Perhaps I don't fully grasp the subtleties of the algorithm but I believe every synthetic record comes from no more than one real record at the end of the day right? In short, specifying which columns should not be modified when generating synthetic records.
@neverfox Oh! I understand what you mean now. As of now, every synthetic ID would be selected at random from an existing / actual ID. However, what you're suggesting is to map each synthetic ID to its corresponding interpolated value. Am I understanding you correctly?
I was thinking it would be useful to be able to specify variables in the data that are neither the target variable nor variables used to perform the resampling, and would just be passed through. A practical use case are models that use offsets or data that contains IDs etc that might be useful for building cross-validation folds with matching unsampled data. Thoughts?