shuttle-hq / synth

The Declarative Data Generator
https://www.getsynth.com/
Apache License 2.0
1.37k stars 108 forks source link

Feature: Conditional sampling #113

Open archnode opened 3 years ago

archnode commented 3 years ago

Required Functionality It would be great to be able to specify a subset of the schema using conditions while calling generate.

Example: When using the bank example, I would like to be able to specify that I only want samples for transactions with currency USD and GBP.

Proposed Solution I could imagine two solutions:

Use case

brokad commented 3 years ago

Thank you @archnode, this is a great idea!

The templating/patching approach we could make it work by following kustomize's UX (ish). For example, we could add a --customize flag to synth generate that takes a kind of customization.json for the schema:

{
  "transactions": {
    "content": {
      "currency": {
        "categorical": {
          "USD": 1,
          "GBP": 1
        }
      }
    }
  }
}

Then running

synth generate --customize customization.json examples/bank/bank_db

works as you want and only generates transactions in USD or GBP.

In terms of merge strategy, for subsetting for primitives (string, number, bool), we'll probably be fine with replacing the leaf schema node and recursing otherwise. So for the example above, we'll just replace the "currency" string node with the new categorical value.

There's code for this already that we currently use for merging in values on import into the schema. It should be fine using it for merging in schema instances.

I am not sure yet how to cover one_of with this though, but we can address that later, in a second PR!

christos-h commented 3 years ago

Looping in @fretz12 who would like to take this one!

fretz12 commented 3 years ago

hey @archnode @brokad - this sounds like a really useful feature. I wanted to run some proposed solutions by you to see if they make sense. I can see the CLI option being helpful in adhoc overrides, and the schema merging sounds useful if you're "permanently" overriding parts of the schema. What if we supported both?

CLI:

Basically a direct string replacement-

synth generate bank_db --override transactions.currency="{\"transactions\": {\"content\": {\"currency\": {\"categorical\": {\"USD\": 1,\"GBP\": 1}}}}}"

Schema Merge:

Basically brokad's way:

synth generate --override customization.json examples/bank/bank_db

I chose the keyword override cuz I've dealt with schema overrides before and that seemed to resonate well.

I'd love to hear your thoughts

brokad commented 3 years ago

@fretz12, why not both indeed, that sounds great! :smile:

For the CLI proposal, we could use Find<Content> which is implemented for each variant node of the schema tree. This is exposed by Namespace::get_s_node. You'll need the string the user submits to be deserializable into a Content though. So your example would rather look like:

synth generate bank_db \
      --override transactions.content.currency="{\"type\": \"string\", \"categorical\": {\"USD\": 1,\"GBP\": 1}}"

It wouldn't be hard to remove the need to have the "type" in there by assuming that if it's not specified, then the existing node's type prevails.

We could do this in two PRs or just one, I don't think this will get too big.

+1 for override

fretz12 commented 3 years ago

@brokad- thanks for the tip! I'll take that into account

archnode commented 2 years ago

Thank you very much for the feedback and the work on this!

Just for completeness: One aspect hat differs for my usecase (and something maybe I could work on in another issue) is that the customization should be more of a verified constraint than an override - I'd like to make sure that e.g. the defined currency is within the bounds of the original definition.