scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.98k stars 25.38k forks source link

Add missing values and categorical features when generating datasets #28952

Open lcrmorin opened 6 months ago

lcrmorin commented 6 months ago

Describe the workflow you want to enable

I am often using random datasets (typically with make_classification). However I often find myself having to add more realistic features to the dataset:

Describe your proposed solution

Introduce parameters to allow for generation of missing data (proportion of missingness, type of missingness - at random, not at random). Introduce parameters to allow for generation of categorical features (number of features, type of repartition in categories - even - uneven - pareto.

Describe alternatives you've considered, if relevant

I usually handle this by hand.

Additional context

Could be used to illustrate imputing techniques, encoding techniques.

oasidorshin commented 6 months ago

@lcrmorin This would be great for testing! I would also suggest adding infinities as possible values, bcs they also break stuff quite often. Also, if randomly generated, making sure to always include at least one NaN and inf value

AK3847 commented 6 months ago

@lcrmorin I suggest adding a noise function or something similar which can generate structured randomness so as to make some sense in data and not pseudo-randomness. Perhaps something like Perlin Noise?

glemaitre commented 5 months ago

Regarding the missing values I recall the following issues/PRs: #6284 / #7084. It seems that the consensus was to have something similar to the ampute R package.

I almost a similar discussion for categorical features but I could not find. For sure, it would be handy to have those two parameters even though we could limit the complexity (e.g. only have a single missingness pattern)

glemaitre commented 5 months ago

Regarding the categorical features, we have the following related issue: #12433

IppotisTheKing commented 1 week ago

I will take a look on it and try to implement this feature in my side by exploring different possibilities to incorporate NaN and inf values.