inbo / whip

✅ Human and machine-readable syntax to express specifications for data
MIT License
7 stars 0 forks source link

Feature request: Checks for multi-column data consistency #23

Open bobular opened 6 years ago

bobular commented 6 years ago

As suggested here https://twitter.com/LifeWatchINBO/status/1042363580107182080

You might have two columns "Country" and "Country Code" (not very well normalised, I know!) - and you might want to check that only one country code is present for each unique country? Or perhaps the number of unique Country+Country-code combinations should be the same as the number of unique countries and also the number of unique country-codes.

My original Twitter example was like this: "Site Name", "Latitude", "Longitude" You might want to validate that for each site, only one lat/long combination exists.

We have other examples (we use ISA-Tab format a lot) where all rows containing a specific value in column C (for example) must have exactly the same values in all rightward columns (D,E,F, etc). It's the same concept as the lat/long example though.

Great project, following with interest!

peterdesmet commented 6 years ago

Thanks for the suggestion! Checking for uniqueness (across all rows or based on the values of another column) is currently difficult to implement in pywhip as the tool uses a row based approach for validation (allowing for data to be read line by line). That means it is possible to compare between columns within a single row (if A contains X, B should contain Y), but not on uniqueness, which is a property of a column across (a subset of) rows.

One approach we are considering is to make whip compliant with table-schema, which has some predefined constraints, including unique. If we can not only make whip compliant, but also support a constraint like unique, we can envision something like:

countryCode:
  if:
    - country:
        unique: True # If country is a unique subset ...
      unique: True   # ... then countryCode should be unique for the same subset

But I don't think it's going to be for the near future. In any case, I'm leaving this issue open as a suggestion for now.

stijnvanhoey commented 5 years ago

At the ICEI conference 2018 a similar suggestion has been proposed.