qri-io / qri

you're invited to a data party!
https://qri.io
GNU General Public License v3.0
1.1k stars 66 forks source link

Request(schema): Add custom validator for nullable fields #1299

Open feep opened 4 years ago

feep commented 4 years ago

@hbruch on Discord:

...to me it seems, that the JSON schema in structure.json needs some additional convention when it is used to describe a CSV schema. As null values are usually expressed as empty strings or some special string like e.g. "None", I recurred to using anyOf rules. (see e.g. https://github.com/qri-io/jsonschema/issues/62), which don't lend themselves to deduce nullable columns.

Discussion follows on Discord. I’ll try to include some relevant bits.

feep commented 4 years ago

@Arqu on discord:

...we should soon be in a position to design some custom schema keywords and behaviors. With that we'd have a set of Qri specific validators to help out. Among others we could do "AllowNullable": true which would at least get you the desired behavior of safely ignoring null/none/nil type of issues on specific fields/columns/conditions.

feep commented 4 years ago

@Arqu on Discord:

There's a ton of type-nerdy stuff that simply forces the hand on some of the issues.

I don’t think this is the case.

Sure, you have no control over jsonschema and what it can do.

But you have total control over structure.json. Use it to make the default UX better.

If someone wants more control over nullable fields, they can edit the inferred structure.json.

When Qri gains nullable support, it should add nullable to the inferred structure for .csvs in structure.json.

Qri already makes some assumptions for a better UX. This adds one more. One that I would say is even less presumptive than inferring "integer" or "string".

As far as I know, .csv has zero concept of NULL. Why should Qri, by default, be stricter?

❯ qri version
0.9.8

❯ qri init
Name of new dataset [scratchx]: 
Format of dataset, csv or json [csv]: 
initialized working directory for new dataset feep/scratchx

❯ subl body.csv

❯ # added some empty fields

❯ cat body.csv
one,two,3
four,five,6
,seven,

❯ qri save
for linked dataset [feep/scratchx]

dataset saved: feep/scratchx@/ipfs/QmfUR89V99nkbC1H5u66Tbzi6HaNRSUPWNkjdWGRz7rQki

❯ # in the future, this generated structure can have nullable by default. =]

❯ cat structure.json
{
 "format": "csv",
 "formatConfig": {
  "lazyQuotes": true
 },
 "qri": "st:0",
 "schema": {
  "items": {
   "items": [
    {
     "title": "field_1",
     "type": "string"
    },
    {
     "title": "field_2",
     "type": "string"
    },
    {
     "title": "field_3",
     "type": "integer"
    }
   ],
   "type": "array"
  },
  "type": "array"
 }
}

❯ qri validate
for linked dataset [feep/scratchx]

0: /2/2: "" type should be integer
feep commented 4 years ago

A bit snipped from a story

It shows what the UX is like for someone who doesn’t care about the difference between NULL and "".

This is what you have to learn as someone who doesn’t care about what a NULL is, but only want to avoid noisy validation errors.

Dense, complicated learning. And decisions about unimportant things, mostly to avoid "None" type should be... errors.

And it is incomplete. The anyOf bits haven’t been written yet. More dense learning.

If "" was nullable in the inferred structure, this could be be avoided.

Or, the time spent in structure.json could be spent on something much more useful like validating that latitude or longitude (or something else critical) are within a valid range.

Story follows:

Structure: structure.json

structure.json is a JSON Schema document used to describe and validate body.csv.

Cut out some copy about field names and other. Not relevant to this issue.

5. 111 errors?

That sounds like a lot of errors. If we do qri validate, every error is a "None" type should be... something besides none.

❯ qri validate  
for linked dataset [feep/ca_state_parks]
0: /32/48: "None" type should be integer  
1: /32/49: "None" type should be integer  
...
111: /xx/xx something something TK rerun

The error line format is Error number: /record(line)/field.

We don’t have any control over this dataset. We don’t even really know what None means in the data. What should we do about it?

Three possibilities:

1: Let Qri flag the errors and ignore them.

We know the "None" type errors exist. If we ignore them without reading, how will we see if there is another kind of error?

2: Convert the None fields to something else, empty or 0.

We don’t know what None means, so how can we know what to convert it to?

3: Tell Qri that None is allowed.

This seems the best option to me. Also, it lets me demonstrate a feature of structure.json. To do this, we want our structure.json to allow None for all the fields. In JSON Schema, that looks something like this (for an integer):

TK update json example
# JSON
"type": ["integer", "string"],  
     "anyOf": [{"type": "integer"}, {"type": "string", "enum": ["None"]}]

You can see an larger example structure.json snippet and the starlark code that was used to generate it in 4, above.

How’d we do?

TK show clean validation
❯ qri validate  

That’s much better. If Qri finds a different error in the future, we will see it instead of losing it in the noise.

Arqu commented 4 years ago

As far as I know, .csv has zero concept of NULL. Why should Qri, by default, be stricter?

This is essentially the root of all the subsequent behavior quirks. To elaborate more on the type-nerdy stuff, the biggest issue is lots of transitions for data/schema between csv > json through Go. What happens is that 'null' type values get defaulted to 'emtpy' kinds of values.

The AllowNullable or however it ends up being called would be a pretty nice band aid around it but the proper solution would be to have a much deeper discussion on what NULL even means in the context of our datasets and how to unify all the behavior around it.

Also thanks a lot for the writeup from above, think this will be super useful for others until we have either of the two things implemented.