jqnatividad / qsv

CSVs sliced, diced & analyzed.
The Unlicense
2.34k stars 66 forks source link

`synthesize` command: schema-informed synthetic data generator #235

Open jqnatividad opened 2 years ago

jqnatividad commented 2 years ago

Using fake, and scanning for the faker keywords in a schema-generated jsonschema description, create more realistic fake test data.

Already, qsv has the generate command, but it doesn't really generate realistic test data - it generates random data informed by profiling a training CSV, and because each generated value is randomly generated based on the training profile, its not as performant.

When enums are specified for a field, use the enums instead.

github-actions[bot] commented 2 years ago

Stale issue message

jqnatividad commented 1 year ago

902 sets the stage to revisit this. Once fake is done, we can then remove the generate command which is just not performant enough and tends to generate gobbly-gook test data anyway.

github-actions[bot] commented 1 year ago

Stale issue message

jqnatividad commented 1 year ago

We should also use frequency when generating "fake" data so that it mirrors the training data more closely.

github-actions[bot] commented 10 months ago

Stale issue message

jqnatividad commented 9 months ago

scheduling for 0.117.0 release

github-actions[bot] commented 7 months ago

Stale issue message

jqnatividad commented 6 months ago

Removing generate command even before fake is done, as generate is unmaintained and has old dependencies weighing down qsv.

jqnatividad commented 6 months ago

Instead of fake, with all its negative connotations, name the command synthesize instead...