Allow keys to be used in constraints where relevant (eg. foreign key in Unique constraint)

LiFaytheGoblin commented 1 year ago

Environment Details

SDV version: 0.16.0
Python version: 3.8
Operating System: Ubuntu 20.04.4

Error Description

I tried to create relational data with 2 tables:

a table of sections with their id, rank, and amount of elements in a section and
a table of elements with their id, which section they belong to, rank within the section, and type of element

For elements I added a Unique constraint for the combination of the columns section and rank, so that the rank is unique per section.

However, now model.sample() returns the error: UserWarning: Unique cannot be transformed because columns: ['section'] were not found. Using the reject sampling approach instead. on the line model.fit(data).

I do not receive any new data.

Steps to reproduce

I use the following code:

from sdv.metadata.dataset import Metadata
from sdv.relational import HMA1

md = Metadata("test-data/metadata-test-2.json")
data = md.load_tables()
model = HMA1(md)
model.fit(data)
new_data = model.sample()

An extract from elements-test-2.csv:

element_id,section,rank,type
1,58964,1,label
2,58964,2,forum
3,58967,2,page
4,58967,1,book

An extract from sections-test-2.csv:

section_id,rank,elements_amount
58964,1,2
58967,2,4

My metadata is as follows:

{
  "tables": {
    "sections": {
        "fields": {
          "section_id": { "type": "id", "subtype": "integer" },
          "rank": {"type": "numerical", "subtype": "integer" },
          "elements_amount": {"type": "numerical", "subtype": "integer" }
        },
        "path": "sections-test-2.csv",
        "primary_key": "section_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank"]
          }
        ]
    },
    "elements": {
        "fields": {
          "element_id": { "type": "id", "subtype": "integer" },
          "rank": {"type": "numerical", "subtype": "integer" },
          "type": { "type": "categorical" },
          "section": {
            "type": "id",
            "subtype": "integer",
            "ref": {
              "table": "sections",
              "field": "section_id"
            }
          }
        },
        "path": "elements-test-2.csv",
        "primary_key": "element_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank", "section"]
          }
        ]
    }
  }
}

The problematic part seems to be

"constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank", "section"]
          }
        ]

since the error is not thrown when I remove this part.

Explanation

Neha explained: "This is happening because you have a foreign key column involved in the Unique constraint. SDV treats primary/foreign keys in a separate layer so it is no longer “found” when it gets to the constraint stage. "

Workaround

I have found the following workaround:

I duplicated the column that was not found, so that I can use one of the identical columns as a Foreign Key and one for my Unique constraint. SDV still learns that the columns are identical and thus in the end I receive unique ranks per section.

Extract of my new elements table:

element_id,section,rank,type,section_alt
1,58964,1,label,58964
2,58964,2,forum,58964
3,58967,2,page,58967
4,58967,1,book,58967

My new metadata:

{
  "tables": {
    "sections": {
        "fields": {
          "section_id": { "type": "id", "subtype": "integer" },
          "rank": {"type": "numerical", "subtype": "integer" },
          "elements_amount": {"type": "numerical", "subtype": "integer" }
        },
        "path": "sections-test-2.csv",
        "primary_key": "section_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank"]
          }
        ]
    },
    "elements": {
        "fields": {
          "element_id": { "type": "id", "subtype": "integer" },
          "rank": { "type": "numerical", "subtype": "integer" },
          "type": { "type": "categorical" },
          "section": {
            "type": "id",
            "subtype": "integer",
            "ref": {
              "table": "sections",
              "field": "section_id"
            }
          },
          "section_alt": { "type": "numerical", "subtype": "integer" }
        },
        "path": "elements-test-2.csv",
        "primary_key": "element_id",
        "constraints": [
          {
            "constraint": "sdv.constraints.Unique",
            "column_names": ["rank", "section_alt"]
          }
        ]
    }
  }
}

Suggestion

A more descriptive error message
Possibly internal handling of this case by SDV, without users needing to find a workaround

npatki commented 1 year ago

Thanks for filing @LiFaytheGoblin, we will investigate and report more info here.

For SDV developers: I think it's fine if such a constraint falls back to our reject sampling approach (instead of transform). It's strange that reject sampling is failing though. Perhaps we are doing it too early, before the foreign key is added back in?

npatki commented 1 year ago

Update: Seems like we explicitly do not support any keys (foreign or primary) in constraints at the moment.

I'll turn this into a feature request and update the title to reflect this.

sdv-dev / SDV