sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

Certain attributes are mapped as Unknown SDType and we have to change the dtype using custom script #2049

Closed ankurpuri1981 closed 1 week ago

ankurpuri1981 commented 4 weeks ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

Certain attributes are mapped as Unknown SDType and we have to change the dtype using custom script. Other attributes are identified correctly. Attached the generated schema json file for reference. Also, for 2 tables, it did not identify the relationship, that we had to handle within the custom script.

Steps to reproduce

Use the input dataset attached to generate metadata for multitable schema and check the metadata json file.

Paste the command(s) you ran and the output.
[multitable_metadata_motor_vehicle_theft.json](https://github.com/user-attachments/files/15739001/multitable_metadata_motor_vehicle_theft.json)
[Input Dataset.zip](https://github.com/user-attachments/files/15739026/Input.Dataset.zip)

If there was a crash, please include the traceback here.
srinify commented 3 weeks ago

Hi there @ankurpuri1981

The SDV does a best guess effort during automatic metadata detection for types and table relationships and then provides convenience methods for updating the metadata to help you tweak and customize it. We've found this approach the best way to balance reducing friction (with best guess automatic metadata detection) with giving users this transparency and control over their metadata, ensuring higher quality synthetic data.

The sdtype is set to Unknown when SDV can't cleanly assign a better sdtype and these fields are treated as PII fields (or personal identifiable information).

It looks like you've already found the metadata updating methods, but I'm also linking here as well so you have them handy: https://docs.sdv.dev/sdv/multi-table-data/data-preparation/multi-table-metadata-api#update-api

Out of curiosity, where does your source data live that you're trying to feed into the SDV? A database? An API end point? Flat files in a file store?

srinify commented 1 week ago

Hi there @ankurpuri1981 I hope my response was useful! I haven't heard from you in 2 weeks so I'm going to move forward with closing this issue out.

Feel free to open a new issue if you have more questions!