sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 304 forks source link

`FixedCombinations` implementation may lose correlations #414

Open npatki opened 3 years ago

npatki commented 3 years ago

Problem Description

Under the hood, the FixedCombinations constraint concatenates the columns to produce unique identifiers (and drops the individual columns). This solves the constraint, but in doing so, it may lose correlations that exist between original columns.

Expected behavior

Consider a table of users belong to different cities & states in the US. There is a fixed combinations constraint between the city & state.

User ID City State Tax Rate
1 San Francisco CA 7.2%
2 Los Angeles CA 7.5%
3 Seattle WA 2.1%
4 Seattle WA 2.5%
5 Spokane WA 3.1%
... ... ... ...

There is correlation where CA corresponds to higher tax rates (regardless of which city in CA). The model should be able to capture this.

With FixedCombinations, the model never looks at CA as common feature. Rather it looks at SanFrancisco+CA and LosAngeles+CA as separate categories. (This may be good enough for certain cases, but IMO it's missing a key input that both locations have a commonality.)

Possible Solutions

  1. Do not drop the city and state columns when modeling. The model may synthesize some unexpected output (eg. CA+SanFrancisco, Boston, NY) but that can be fixed later through some logic.
  2. Create a new table to identify a City, State pair. For eg. Location ID, City, State. Then use that identifier (Location ID) as a primary key to reference in the Users table.
npatki commented 1 year ago

Updated title and task since the constraint has been renamed to FixedCombinations. Issue still holds.