sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.37k stars 316 forks source link

Create Unique Constraint #532

Closed amontanez24 closed 3 years ago

amontanez24 commented 3 years ago

Problem Description

Sometimes, even though a column isn't a key, the values in it need to be unique in order for the data to be valid. Consider the following 2 use cases:

  1. There is a sensitive field in my data (such as SSN) that cannot be repeated. Even though I have another primary key (user_id), I need to makes sure that SSN is also unique
  2. I have a table that lists all the shoe types I sell. I have models (eg. Nike, Reebok, etc.) and sizes (6, 6.5, 7, etc.). I to make sure that each size is unique within a given model. Eg. I can't have 2 rows that are size 7.5 Reebok

In order to support this, write a Unique constraint where the user can supply both a column (whose values must be unique) and an optional list of group_by columns that control the partition that determine uniqueness. In our cases

  1. column='SSN' with group_by=None -- which means it's unique throughout the table
  2. column='size' while group_by=['model'] -- which means that within a given model, size is unique
csala commented 3 years ago

This sounds good @amontanez24 but I would suggest one change in the proposal: Instead of having a column argument, I would have a columns argument so that it is possible to ensure that the combination of multiple columns stays unique within the table.

npatki commented 3 years ago

It should also be possible for Unique and UniqueCombinations to work with each other that way.

Eg. a city, country pair is a UniqueCombination but maybe in my dataset, I don't want each pair to appear more than once -- as in, there should only be 1 row for Paris, France

csala commented 3 years ago

If Unique is implemented using reject sampling, the two should be able to be combined without issues. The only thing that will matter is the order: If Unique comes after UniqueCombinations, and if UniqueCombinations is using the transform strategy, it will fail because Unique will not find the corresponding columns. If Unique is passed before UniqueCombinations, everything should work:

from sdv.constraints import Unique, UniqueCombinations

constraints = [
    Unique(columns=['country', 'city']),
    UniqueCombinations(columns=['country', 'city']),
]
my_model = MyTabularModel(constraints=constraints)
npatki commented 3 years ago

Could we consider having a extra_combinations kwarg so that nobody runs into this case?

from sdv.constraints import Unique

cons = Unique(
    columns=['country', 'city'], # a country, city pair can only appear once
    extra_combinations=False # don't make extra combinations outside of the original data
)

my_model = MyTabularModel(constraints=[cons])
csala commented 3 years ago

I would not do it, for multiple reasons:

  1. Each Constraint should have only one purpose, otherwise it will become too complex and confusing.
  2. Since UniqueCombinations would still exist, someone who is familiar with it (and maybe not so much with Unique) could still define both Unique and UniqueCombinations in the wrong order and hit the error anyway.
  3. The UniqueCombinations implementation is already complex on its own, so having Unique cover the same functionality would introduce additional complexity that may end up in maintenance overhead and potential errors.

So, altogether, I would rather say that we should cover this in a separate issue about making the constraints robust enough (or have enough validations) so that incompatible ordering is not possible, either because using the wrong order raises a helpful error message from which the user can learn how to sort the constraints properly, or because SDV is able to internally figure out the right order and apply it.

npatki commented 3 years ago

+1 to a new issue about making sure SDV can robustly handle multiple constraints. This seems like a future project. For the current scope -- a descriptive error message would be nice.

Food for thought:

Each Constraint should have only one purpose, otherwise it will become too complex and confusing.

This isn't 100% true right now. For eg, the GreaterThan constraint can be used for two purposes (scalar comparison or column comparison) and the Between constraint is essentially a shortcut for two GreaterThan constraints.

I actually like this setup. To me, it's simple to have 1 constraint per problem being encountered. Together, the constraint need not necessarily be mutually exclusive primitives.

csala commented 3 years ago

I see that @amontanez24 already opened issue #541 to sort out the problem of combining multiple constraints that affect the same columns, so here we can stick to the single Unique purpose of ensuring unique values.

Also, it seems like the group_by argument stops being necessary once multiple columns can be specified, so we can drop it from the specification.