Closed misken closed 1 month ago
Hi @misken,
Thank you for the feedback!
Yes, that is an excellent question & it's on our radar; we intend to further explore this in future work around this package. The workaround you mentioned would be what we currently recommend, but we are definitely thinking about functional (manually specified) constraints as a feature directly within metasyn.
As an example of a potential generic solution, you can check my exploratory gist on this topic using permutations (esp. the images at the bottom). This gist is programmed in Julia but gives a nice indication that many functional relationships may be possible here: https://gist.github.com/vankesteren/6c141f7cabcd3eb47292d78cfca1804d
To be more specific about the workaround, in your case I would
Yes, that makes sense as it retains the actual differences. Thanks.
I work with healthcare process data a lot and the need for synthetic data generation in this domain is huge. As part of testing
metasyn
, I grabbed one of my common test datasets which contains records of patients visiting a short stay unit. Here's what a little bit of it looks like:If this was real patient data and I wanted to use
metasyn
to generate a synthetic version, I would have a few key requirements:InRoomTS
andOutRoomTS
would need to be valid datetimes that fall within some specified range,OutRoomTS
must be greater than theInRoomTS
for obvious reasons.Can
metasyn
handle this type of use case? In the Generating MetaFrames section of the docs I see the use oftry_parse_dates
when reading the Polars dataframe. I tried that on my data and indeed it worked perfectly in terms of each column getting valid synthetic datetimes and the min and max values being enforced. Is there any way to enforce theOutRoomTS
>InRoomTS
constraint? I realize this is a "relationship" between variables and thatmetasyn
explicitly wants to avoid retaining relational information. However, this type of relationship is more of a physical relationship and doesn't compromise confidentiality since theInRoomTS
would be synthetic. This is certainly not a deal breaker regarding your paper, just a genuine question regarding the capabilities ofmetasyn
for this type of use case.I realize, of course, that I could simply use
metasyn
to create synthetic versions ofInRoomTS
andLOS_hours
and then use those to computeOutRoomTS
. In most real datasets, computed fields likeLOS_hours
would not exist and we'd just have the timestamps. Maybe I've just answered my own question. If I didn't have theLOS_hours
field, I'd simply havemetasyn
generate one using one of the built in distributions and specifying that there was no underlying data for the field.Kudos to the speed of your package. I imagine using
polars
contributes to this.