sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Does SDV support schemaless database like MongoDB. If yes, could you please provide docs/examples #1527

Closed amuthan-sakthivel closed 1 year ago

npatki commented 1 year ago

Hi @amuthan-sakthivel, nice to meet you. I am not an expert in MongoDB but I know that you can make the SDV work for most schemaless databases.

The basic requirement is that your data points have similar fields, and that the fields should be basic ones such as numbers, strings, etc. (SDV does not currently handle rich media such as audio or images).

As long as those conditions are met, you should be able to express the data in a tabular format that the SDV accepts. As a simple example, consider that your records may be in a format like this:

{
    "user_id": "dc0x3",
    "age": 34,
    "credit_type": "VISA",
    ...
}, {
    "user_id": "d1cqa",
    "age": 23
   ...
}

Then it can be converted to a tabular format such as:

user_id age credit_type ...
dc0x3 34 VISA ...
d1cqa 23 None ...
... ... ... ...

The SDV accepts this format and creates synthetic data for it. (Notice how if some fields are missing, you always mark them as None in the table.)

Let me know if that answers your question!

Notes:

  1. The SDV still requires you to describe the data types and formats of each column as metadata.
  2. More complex representations may require more careful conversion. If you're able to provide us with more information about your use case, we may be able to guide you. I am aware of users who have successfully created synthetic data for this type of database!
amuthan-sakthivel commented 1 year ago

@npatki - Thanks for answering. We want to basically create synthetic test data by taking references from the prod database. Each collection has a different schema and it seems we might have to spend lot of efforts for these 2 activities,

  1. converting json to table format (please let us know if there is some sort of utility already available)
  2. metadata generation for each of the collections we have
npatki commented 1 year ago

Hi @amuthan-sakthivel, my pleasure. From my experience helping other users, this is not too much work as long as you have an understanding of what the SDV library expects. There are also some functions available to you for convienence.

  1. Converting JSON to table: The pandas library (which automatically ships with the SDV) has a number of convenience functions for data manipulation, including a read json
  2. Metadata generation: The SDV offers automatic metadata detection for single and multi table formats.
  3. SDV Demos: I'd recommend going through the SDV Demos to better understand the SDV.

We're happy to help here too. Do you have an example of different collections and how their data schemas are different?

npatki commented 1 year ago

Hi @amuthan-sakthivel, do you have anything further to discuss around this topic?

Since this issue has been inactive for a few weeks, I'm closing it off as answered. Please feel free to reply if there are any follow-ups. I can always reopen the issue.