Under the hood, the FixedCombinations constraint concatenates the columns to produce unique identifiers (and drops the individual columns). This solves the constraint, but in doing so, it may lose correlations that exist between original columns.
Expected behavior
Consider a table of users belong to different cities & states in the US. There is a fixed combinations constraint between the city & state.
User ID
City
State
Tax Rate
1
San Francisco
CA
7.2%
2
Los Angeles
CA
7.5%
3
Seattle
WA
2.1%
4
Seattle
WA
2.5%
5
Spokane
WA
3.1%
...
...
...
...
There is correlation where CA corresponds to higher tax rates (regardless of which city in CA). The model should be able to capture this.
With FixedCombinations, the model never looks at CA as common feature. Rather it looks at SanFrancisco+CA and LosAngeles+CA as separate categories. (This may be good enough for certain cases, but IMO it's missing a key input that both locations have a commonality.)
Possible Solutions
Do not drop the city and state columns when modeling. The model may synthesize some unexpected output (eg. CA+SanFrancisco, Boston, NY) but that can be fixed later through some logic.
Create a new table to identify a City, State pair. For eg. Location ID, City, State. Then use that identifier (Location ID) as a primary key to reference in the Users table.
Problem Description
Under the hood, the
FixedCombinations
constraint concatenates the columns to produce unique identifiers (and drops the individual columns). This solves the constraint, but in doing so, it may lose correlations that exist between original columns.Expected behavior
Consider a table of users belong to different cities & states in the US. There is a fixed combinations constraint between the city & state.
There is correlation where
CA
corresponds to higher tax rates (regardless of which city inCA
). The model should be able to capture this.With
FixedCombinations
, the model never looks atCA
as common feature. Rather it looks atSanFrancisco+CA
andLosAngeles+CA
as separate categories. (This may be good enough for certain cases, but IMO it's missing a key input that both locations have a commonality.)Possible Solutions
city
andstate
columns when modeling. The model may synthesize some unexpected output (eg.CA+SanFrancisco, Boston, NY
) but that can be fixed later through some logic.City, State
pair. For eg.Location ID, City, State
. Then use that identifier (Location ID
) as a primary key to reference in theUsers
table.