sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

Getting KeyError while generation of data (synthesizer.sample()) - sdv==1.12.1 #2026

Closed burhanuddin-123 closed 3 weeks ago

burhanuddin-123 commented 1 month ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I am looking to generate synthetic data at scale, for two tables (Customers, and Orders) having a relationship between them, where customers is a parent and orders as a child. After Validating the MultiTableMetadata and applying constraints, I was also able to fit the HMASynthesizer on real data.

But while generating the sample data, I am getting the following error:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File c:\Users\burha\Mentorskool\Synthetic Data Vault\new-venv\Lib\site-packages\pandas\core\indexes\base.py:3805, in Index.get_loc(self, key)
   [3804](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3804) try:
-> [3805](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3805)     return self._engine.get_loc(casted_key)
   [3806](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3806) except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas\\_libs\\hashtable_class_helper.pxi:2606, in pandas._libs.hashtable.Int64HashTable.get_item()

File pandas\\_libs\\hashtable_class_helper.pxi:2630, in pandas._libs.hashtable.Int64HashTable.get_item()

KeyError: 7

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[208], [line 2](vscode-notebook-cell:?execution_count=208&line=2)
      [1](vscode-notebook-cell:?execution_count=208&line=1) # Step 3: Generate synthetic data
----> [2](vscode-notebook-cell:?execution_count=208&line=2) synthetic_data = synthesizer.sample(scale=0.01)  # it gives error

File c:\Users\burha\Mentorskool\Synthetic Data Vault\new-venv\Lib\site-packages\sdv\multi_table\base.py:423, in BaseMultiTableSynthesizer.sample(self, scale)
...
   [3815](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3815)     #  InvalidIndexError. Otherwise we fall through and re-raise
   [3816](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3816)     #  the TypeError.
   [3817](file:///C:/Users/burha/Mentorskool/Synthetic%20Data%20Vault/new-venv/Lib/site-packages/pandas/core/indexes/base.py:3817)     self._check_indexing_error(key)

KeyError: 7

I had tried to generate it multiple times, and each time I got different KeyError, such as KeyError: 4, KeyError: 7, and so on. It is difficult to identify the root cause of this error.

srinify commented 1 month ago

Hi @burhanuddin-123 👋

Do you mind sharing more context with us so we can try to reproduce the issue on our end?

One thing I want to rule out is missing referential integrity, where all references in a foreign key reference a valid, existing primary key value. We created a function in our utils library to help process your data before model fitting. Try doing this step first before fitting and sampling. I doubt this is the issue since SDV usually checks for ref integrity, but still want to rule it out first.

srinify commented 1 month ago

Hi there @burhanuddin-123 are you still running into this issue?

Another user ran into a very similar issue and it seems to be related to the scale parameter in their case. What value are you using for scale when sampling from HMA Synthesizer?

We opened this new issue to track the bug with the proposed solution as well: https://github.com/sdv-dev/SDV/issues/2045

srinify commented 3 weeks ago

Hi there @burhanuddin-123 I haven't heard from you in a while so I'm going to go ahead and close this issue out. Please see the suggested workaround if you're still running into this issue: https://github.com/sdv-dev/SDV/issues/2045#issue-2334275417