jqnatividad / qsv

CSVs sliced, diced & analyzed.

The Unlicense

2.34k stars 66 forks source link

`synthesize` command: schema-informed synthetic data generator #235

Open jqnatividad opened 2 years ago

jqnatividad commented 2 years ago

Using fake, and scanning for the faker keywords in a schema-generated jsonschema description, create more realistic fake test data.

Already, qsv has the generate command, but it doesn't really generate realistic test data - it generates random data informed by profiling a training CSV, and because each generated value is randomly generated based on the training profile, its not as performant.

When enums are specified for a field, use the enums instead.

github-actions[bot] commented 2 years ago

Stale issue message

jqnatividad commented 1 year ago

902 sets the stage to revisit this. Once `fake` is done, we can then remove the `generate` command which is just not performant enough and tends to generate gobbly-gook test data anyway.

github-actions[bot] commented 1 year ago

Stale issue message

jqnatividad commented 1 year ago

We should also use frequency when generating "fake" data so that it mirrors the training data more closely.

github-actions[bot] commented 10 months ago

Stale issue message

jqnatividad commented 9 months ago

scheduling for 0.117.0 release

github-actions[bot] commented 7 months ago

Stale issue message

jqnatividad commented 6 months ago

Removing generate command even before fake is done, as generate is unmaintained and has old dependencies weighing down qsv.

jqnatividad commented 6 months ago

Instead of fake, with all its negative connotations, name the command synthesize instead...

jqnatividad / qsv

`synthesize` command: schema-informed synthetic data generator #235

902 sets the stage to revisit this. Once fake is done, we can then remove the generate command which is just not performant enough and tends to generate gobbly-gook test data anyway.

902 sets the stage to revisit this. Once `fake` is done, we can then remove the `generate` command which is just not performant enough and tends to generate gobbly-gook test data anyway.