soma-smart / Fakelake

Generate massive fake datasets for your datalake, fast. By SOMA
https://soma-smart.github.io/Fakelake/
MIT License
17 stars 1 forks source link

[Feature] Idempotence - Generate random dataset deterministically #47

Open bhagenbourger opened 3 months ago

bhagenbourger commented 3 months ago

I propose to add a feature to generate the same output several times.

To add deterministic way, I suggest to add a seed and generate all values from this seed. For the same seed set as parameter of fakelake, the output will be the same.

For example : fakelake generate --seed xxxxxx path/to/schema.yaml If seed is not passed as parameter, a random seed is generated. After file generation, the used seed is printed.

This feature enables to generate the same dataset in different formats (CSV and PARQUET for example). Also, it easier to share a seed than a full dataset if you want to reproduce something on another environment.

vianneybacoup commented 3 months ago

The only concern I have for now is that Parquet is generated with threads. So either:

bhagenbourger commented 3 months ago

Thank you for these informations, will think about it.