google / sre_yield

Python module to generate regular all expression matches
Apache License 2.0
188 stars 45 forks source link

Random values #29

Open jayvdb opened 4 years ago

jayvdb commented 4 years ago

There are lots of regex expanders which provide only one feature, and it is a feature missing from this library: Random values.

The result is that other similar codebases, typically not as well built (often broken or incomplete sre handling that is "good enough" for MVP), are getting more brain power invested in them.

No doubt this library can be adapted to this easily, since it provides rather efficient slicing, so it would be simple to do a random slice into the sequence to get a random value.

IMO that is worth building into this library, heralding it, and over time improving the performance by providing additional slicers that obtain a less-random value that is known to be easier to obtain.

If the random slicer is able to be used repetitively, it can be used as a mechanism for thinning a large result space https://github.com/google/sre_yield/issues/2

fwiw, I am not suggesting that the larger use case of "fake data" is included in this library. I think that there should be many libraries which approach that type of problem. I see the objective as adding to this library the tools they would need to generate fake data values with high performance using an almost complete regex syntax.

thatch commented 4 years ago

Go for it; I don't think the data model is robust enough to do a tree-to-tree transform reliably, but if there was something like a get_random_item(rand=Random(), bias=lambda length: (1 / length)**2) call that did a recursive walk akin to get_item, that would come in useful.

jayvdb commented 4 years ago

Great. I'll put the related helpers in a module random.py. I don't foresee making any changes in the main modules. I would prefer the random value code to be sub-optimal than require additional complexity in the sequence classes. It doesnt need tree-to-tree. Just picking values out of the result space, ideally as a generator so multiple random values can be obtained from the same sequence without repeating until the sequence has been exhausted.