Append-Only Dataset Validation

b5 commented 6 years ago

What is it

appendOnly as an inter-snapshot validation check that rejects any attempt to update a dataset with changes that modify data that existed in the prior snapshot. If appendOnly has been true for all snapshots of a dataset, the dataset is truly append only.

Why we need it

@ajbouh has requested append-only capacity for datasets, citing append-only as an important rule for Machine Learning use cases. We've encountered many other use cases where append-only is a necessity, so I think it makes sense to do it now.

Why it's wrong and bad

Chucking appendOnly in as a struct field is a shortsighted footgun. We should properly plan this feature out, looking for other edit restrictions & structure this as a configuration instance in a broader category of inter-snapshot validation.

If anyone has time or energy to invest in this better solution, I'm interested. But we have neither time nor energy, and append-only is such a frequently-cited use that I'm a fan of having it as an explicit flag that doesn't require me to read through, like, all the documentation. Deadline to object to us using this footgun is 2018-04-13. Comment on this issue to prevent footgunning.

How we gonna do it

Any dataset writes must pass through dsfs.CreateDataset. CreateDataset rejects when validate.Dataset returns an error. So we need a few things:

a way to signal to validate.Dataset that it should validate that a change is append-only
a validator that actually checks for append-only-ness
a code path that calls that validator in validate.Dataset
a well-written error to send back when things are bad
tesssssssts

Given that it's validation oriented, the natural place to put the appendOnly flag is in dataset.Structure.Schema. The field itself is a jsonschema, and jsonschema doesn't support this kind of validation out of the box. There is some tangentially-related discussion regarding changing jsonschema for off-label uses here: https://github.com/json-schema-org/json-schema-spec/issues/309

Thankfully, our jsonschema package comes with support for custom validators. So let's implement this as a custom jsonschema validator, registered under the appendOnly keyword that accepts a bool value. The validator should use a factory function pattern to generate validator instances that check against an existing dataset. Put another way, on each call to dataset.Validate, we need to generate a new appendOnly validator from the dataset we were given, because the check has to happen in context of a snapshot. This is the nasty part of this job, but will be worth it. In the future we can use this technique to do all sorts of inter-snapshot checks, and be able to control performance characteristics on the spot.

So this is also going to require a PR on jsonschema. I'll file that issue later. Basic gist is we need to change this to use some sort of decoder-generator pattern instead of a hardcoded DefaultValidator thingie.

Side Effects

after this PR, any time we validate a dataset, it must call validate.Dataset, because that's where we're loading this custom stuff, and I don't want this logic repeated all over the place
this is going to cause some performance slowdowns, because dataset.Validate is now going to need to be able to load the previous dataset's data to do validation. I'm thinking we'll compensate for this by transforming dataset.PrevPath to be a *dataset.Dataset that can carry an in-memory reference. We should work to make sure these slowdowns only trigger when appendOnly is in play.

As per our Q2 goals, I'm not allowed to ship this issue. So someone else is going to have to step up 😄

b5 commented 6 years ago

bonus points if we land benchmarks for dataset.Validate in this PR

ajbouh commented 6 years ago

Thanks for the thorough consideration, @b5!

I should point out that performance-wise, my alternative to a qri-specific approach would be pretty fast. It's just: demand that the first N bytes of the current dataset have the same checksum that the previous version had, where N is the length of the previous dataset in bytes. This places a few restrictions on data layout and things like that, but is overall fairly easy to implement, optimize, and understand.

b5 commented 6 years ago

Ah yes, I should be clear that the performance hit I'm referring to here comes from data overfetching our architecture currently implies. I'm hoping to tackle after we land the change in future versions, just wanted to make clear in the issue that I'm fine with a short-term slowdown to land this feature

qri-io / dataset