sodascience / metasyn

Transparent and privacy-friendly synthetic data generation
https://metasyn.readthedocs.io
MIT License
38 stars 9 forks source link

Use case question involving timestamped data (JOSS review) #324

Closed misken closed 1 month ago

misken commented 1 month ago

I work with healthcare process data a lot and the need for synthetic data generation in this domain is huge. As part of testing metasyn, I grabbed one of my common test datasets which contains records of patients visiting a short stay unit. Here's what a little bit of it looks like:

PatID,InRoomTS,OutRoomTS,PatType,LOS_hours
1,2024-01-01 07:44:00,2024-01-01 09:20:00,IVT,1.6
2,2024-01-01 08:28:00,2024-01-01 11:13:00,IVT,2.75
3,2024-01-01 11:44:00,2024-01-01 12:48:00,MYE,1.0666666666666667
4,2024-01-01 11:51:00,2024-01-01 21:10:00,CAT,9.316666666666666

If this was real patient data and I wanted to use metasyn to generate a synthetic version, I would have a few key requirements:

Can metasyn handle this type of use case? In the Generating MetaFrames section of the docs I see the use of try_parse_dates when reading the Polars dataframe. I tried that on my data and indeed it worked perfectly in terms of each column getting valid synthetic datetimes and the min and max values being enforced. Is there any way to enforce the OutRoomTS > InRoomTS constraint? I realize this is a "relationship" between variables and that metasyn explicitly wants to avoid retaining relational information. However, this type of relationship is more of a physical relationship and doesn't compromise confidentiality since the InRoomTS would be synthetic. This is certainly not a deal breaker regarding your paper, just a genuine question regarding the capabilities of metasyn for this type of use case.

I realize, of course, that I could simply use metasyn to create synthetic versions of InRoomTS and LOS_hours and then use those to compute OutRoomTS. In most real datasets, computed fields like LOS_hours would not exist and we'd just have the timestamps. Maybe I've just answered my own question. If I didn't have the LOS_hours field, I'd simply have metasyn generate one using one of the built in distributions and specifying that there was no underlying data for the field.

Kudos to the speed of your package. I imagine using polars contributes to this.

vankesteren commented 1 month ago

Hi @misken,

Thank you for the feedback!

Yes, that is an excellent question & it's on our radar; we intend to further explore this in future work around this package. The workaround you mentioned would be what we currently recommend, but we are definitely thinking about functional (manually specified) constraints as a feature directly within metasyn.

As an example of a potential generic solution, you can check my exploratory gist on this topic using permutations (esp. the images at the bottom). This gist is programmed in Julia but gives a nice indication that many functional relationships may be possible here: https://gist.github.com/vankesteren/6c141f7cabcd3eb47292d78cfca1804d

vankesteren commented 1 month ago

To be more specific about the workaround, in your case I would

misken commented 1 month ago

Yes, that makes sense as it retains the actual differences. Thanks.