sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 292 forks source link

Allow keys to be used in constraints where relevant (eg. foreign key in Unique constraint) #961

Open LiFaytheGoblin opened 1 year ago

LiFaytheGoblin commented 1 year ago

Environment Details

Error Description

I tried to create relational data with 2 tables:

For elements I added a Unique constraint for the combination of the columns section and rank, so that the rank is unique per section.

However, now model.sample() returns the error: UserWarning: Unique cannot be transformed because columns: ['section'] were not found. Using the reject sampling approach instead. on the line model.fit(data).

I do not receive any new data.

Steps to reproduce

I use the following code:

from sdv.metadata.dataset import Metadata
from sdv.relational import HMA1

md = Metadata("test-data/metadata-test-2.json")
data = md.load_tables()
model = HMA1(md)
model.fit(data)
new_data = model.sample()

An extract from elements-test-2.csv:

element_id,section,rank,type
1,58964,1,label
2,58964,2,forum
3,58967,2,page
4,58967,1,book

An extract from sections-test-2.csv:

section_id,rank,elements_amount
58964,1,2
58967,2,4

My metadata is as follows:

{
  "tables": {
    "sections": {
        "fields": {
          "section_id": { "type": "id", "subtype": "integer" },
          "rank": {"type": "numerical", "subtype": "integer" },
          "elements_amount": {"type": "numerical", "subtype": "integer" }
        },
        "path": "sections-test-2.csv",
        "primary_key": "section_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank"]
          }
        ]
    },
    "elements": {
        "fields": {
          "element_id": { "type": "id", "subtype": "integer" },
          "rank": {"type": "numerical", "subtype": "integer" },
          "type": { "type": "categorical" },
          "section": {
            "type": "id",
            "subtype": "integer",
            "ref": {
              "table": "sections",
              "field": "section_id"
            }
          }
        },
        "path": "elements-test-2.csv",
        "primary_key": "element_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank", "section"]
          }
        ]
    }
  }
}

The problematic part seems to be

"constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank", "section"]
          }
        ]

since the error is not thrown when I remove this part.

Explanation

Neha explained: "This is happening because you have a foreign key column involved in the Unique constraint. SDV treats primary/foreign keys in a separate layer so it is no longer “found” when it gets to the constraint stage. "

Workaround

I have found the following workaround:

I duplicated the column that was not found, so that I can use one of the identical columns as a Foreign Key and one for my Unique constraint. SDV still learns that the columns are identical and thus in the end I receive unique ranks per section.

Extract of my new elements table:

element_id,section,rank,type,section_alt
1,58964,1,label,58964
2,58964,2,forum,58964
3,58967,2,page,58967
4,58967,1,book,58967

My new metadata:

{
  "tables": {
    "sections": {
        "fields": {
          "section_id": { "type": "id", "subtype": "integer" },
          "rank": {"type": "numerical", "subtype": "integer" },
          "elements_amount": {"type": "numerical", "subtype": "integer" }
        },
        "path": "sections-test-2.csv",
        "primary_key": "section_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank"]
          }
        ]
    },
    "elements": {
        "fields": {
          "element_id": { "type": "id", "subtype": "integer" },
          "rank": { "type": "numerical", "subtype": "integer" },
          "type": { "type": "categorical" },
          "section": {
            "type": "id",
            "subtype": "integer",
            "ref": {
              "table": "sections",
              "field": "section_id"
            }
          },
          "section_alt": { "type": "numerical", "subtype": "integer" }
        },
        "path": "elements-test-2.csv",
        "primary_key": "element_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank", "section_alt"]
          }
        ]
    }
  }
}

Suggestion

npatki commented 1 year ago

Thanks for filing @LiFaytheGoblin, we will investigate and report more info here.

For SDV developers: I think it's fine if such a constraint falls back to our reject sampling approach (instead of transform). It's strange that reject sampling is failing though. Perhaps we are doing it too early, before the foreign key is added back in?

npatki commented 1 year ago

Update: Seems like we explicitly do not support any keys (foreign or primary) in constraints at the moment.

I'll turn this into a feature request and update the title to reflect this.