Closed amontanez24 closed 3 years ago
This sounds good @amontanez24 but I would suggest one change in the proposal: Instead of having a column
argument, I would have a columns
argument so that it is possible to ensure that the combination of multiple columns stays unique within the table.
It should also be possible for Unique
and UniqueCombinations
to work with each other that way.
Eg. a city, country
pair is a UniqueCombination
but maybe in my dataset, I don't want each pair to appear more than once -- as in, there should only be 1 row for Paris, France
If Unique
is implemented using reject sampling, the two should be able to be combined without issues. The only thing that will matter is the order: If Unique
comes after UniqueCombinations
, and if UniqueCombinations
is using the transform
strategy, it will fail because Unique
will not find the corresponding columns. If Unique
is passed before UniqueCombinations
, everything should work:
from sdv.constraints import Unique, UniqueCombinations
constraints = [
Unique(columns=['country', 'city']),
UniqueCombinations(columns=['country', 'city']),
]
my_model = MyTabularModel(constraints=constraints)
Could we consider having a extra_combinations
kwarg so that nobody runs into this case?
from sdv.constraints import Unique
cons = Unique(
columns=['country', 'city'], # a country, city pair can only appear once
extra_combinations=False # don't make extra combinations outside of the original data
)
my_model = MyTabularModel(constraints=[cons])
I would not do it, for multiple reasons:
UniqueCombinations
would still exist, someone who is familiar with it (and maybe not so much with Unique
) could still define both Unique
and UniqueCombinations
in the wrong order and hit the error anyway.UniqueCombinations
implementation is already complex on its own, so having Unique
cover the same functionality would introduce additional complexity that may end up in maintenance overhead and potential errors.So, altogether, I would rather say that we should cover this in a separate issue about making the constraints robust enough (or have enough validations) so that incompatible ordering is not possible, either because using the wrong order raises a helpful error message from which the user can learn how to sort the constraints properly, or because SDV is able to internally figure out the right order and apply it.
+1 to a new issue about making sure SDV can robustly handle multiple constraints. This seems like a future project. For the current scope -- a descriptive error message would be nice.
Food for thought:
Each Constraint should have only one purpose, otherwise it will become too complex and confusing.
This isn't 100% true right now. For eg, the GreaterThan
constraint can be used for two purposes (scalar comparison or column comparison) and the Between
constraint is essentially a shortcut for two GreaterThan
constraints.
I actually like this setup. To me, it's simple to have 1 constraint per problem being encountered. Together, the constraint need not necessarily be mutually exclusive primitives.
I see that @amontanez24 already opened issue #541 to sort out the problem of combining multiple constraints that affect the same columns, so here we can stick to the single Unique
purpose of ensuring unique values.
Also, it seems like the group_by
argument stops being necessary once multiple columns
can be specified, so we can drop it from the specification.
Problem Description
Sometimes, even though a column isn't a key, the values in it need to be unique in order for the data to be valid. Consider the following 2 use cases:
user_id
), I need to makes sure thatSSN
is also uniquemodels
(eg. Nike, Reebok, etc.) andsizes
(6, 6.5, 7, etc.). I to make sure that each size is unique within a given model. Eg. I can't have 2 rows that are size 7.5 ReebokIn order to support this, write a
Unique
constraint where the user can supply both acolumn
(whose values must be unique) and an optional list ofgroup_by
columns that control the partition that determine uniqueness. In our casescolumn='SSN'
withgroup_by=None
-- which means it's unique throughout the tablecolumn='size'
whilegroup_by=['model']
-- which means that within a given model, size is unique