question about new/replicated UBL data and range of creation area

AndreMikulec commented 5 years ago

Hi,

I have a question. When I generate new data or replicated data using UBL functions, does the new/replicated data ever get created out side the range of the original data?

For example, if my input data is, for example, 2 values, 1 and 10, if I use an UBL function, e.g. SmoteRegress or other, does any new data get created outside the range of 1 and 10, for example, new data at 0.5 or 10.5?

This is important because, in, for example, a time series, the train data, test data, and validation data can not leak into any of each others areas, for example test data leaking into validation data, for example new generated train data at 10.5 would have leaked into the test area.

If new data is created at 10.5, is there an easy-early way to detect (or better: prevent), this from happening?

paobranco commented 5 years ago

Hi,

When you use the UBL functions in regression problems there is one where you can actually obtain new example outside the target variable range. The GaussNoiseRegression can have this effect and I have no mechanism for preventing this. All other functions for regression will not generate new cases outside the target variable range. This includes the functions that generate synthetic cases such as SmoteRegress.

Thank you for mentioning this. This can be an important issue for certain applications and I will try to include such a mechanism in the new UBL version.

AndreMikulec commented 5 years ago

@paobranco,

Thanks.

paobranco / UBL

question about new/replicated UBL data and range of creation area #3