sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 302 forks source link

Metadata invalid for nations_v1, Toxicology_v1 datasets #1606

Closed JanJacekJaniszewski closed 11 months ago

JanJacekJaniszewski commented 11 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

The metadata for nations_v1 is not valid since it connects a primary with a non-primary key.

Steps to reproduce: 1st bug

Input

from sdv.multi_table import HMASynthesizer
from sdv.datasets.demo import download_demo

real_data, real_metadata = download_demo(
    modality='multi_table',
    dataset_name='nations_v1'
)

synthesizer = HMASynthesizer(real_metadata)

Output

InvalidMetadataError: The metadata is not valid
Relationships:
Invalid relationship between table 'stat' and table 'country'. A relationship must connect a primary key with a non-primary key.

Steps to reproduce: 2nd bug

Input

from sdv.multi_table import HMASynthesizer
from sdv.datasets.demo import download_demo

real_data, real_metadata = download_demo(
    modality='multi_table',
    dataset_name='Toxicology_v1'
)

synthesizer = HMASynthesizer(real_metadata)

Output

InvalidDataError: The provided data does not match the metadata:
Relationships:
Error: foreign key column 'molecule_id' contains unknown references: (TR003, TR005, TR013, TR016, TR018, + more). All the values in this column must reference a primary key.
Error: foreign key column 'molecule_id' contains unknown references: (TR003, TR005, TR013, TR016, TR018, + more). All the values in this column must reference a primary key.
npatki commented 11 months ago

Hi @JanJacekJaniszewski, thanks for reporting.

Indeed, we are aware that there a few multi-table schemas that currently have invalid metadata. We're tracking this in issue #1297. I will mark this issue as a duplicate in favor of the existing one.

Note that this has not yet been prioritized, as most of users will (a) try out the demo notebooks, and (b) use the SDV on their own datasets. I'm curious if the nations_v1 and Toxicology_v1 datasets are useful for your project in some way? We prioritize these issues based on demand, so if you could describe a little more about what you're working on, that would be helpful. Thanks.