Open b5 opened 6 years ago
bonus points if we land benchmarks for dataset.Validate
in this PR
Thanks for the thorough consideration, @b5!
I should point out that performance-wise, my alternative to a qri-specific approach would be pretty fast. It's just: demand that the first N bytes of the current dataset have the same checksum that the previous version had, where N is the length of the previous dataset in bytes. This places a few restrictions on data layout and things like that, but is overall fairly easy to implement, optimize, and understand.
Ah yes, I should be clear that the performance hit I'm referring to here comes from data overfetching our architecture currently implies. I'm hoping to tackle after we land the change in future versions, just wanted to make clear in the issue that I'm fine with a short-term slowdown to land this feature
What is it
appendOnly
as an inter-snapshot validation check that rejects any attempt to update a dataset with changes that modify data that existed in the prior snapshot. IfappendOnly
has been true for all snapshots of a dataset, the dataset is truly append only.Why we need it
@ajbouh has requested append-only capacity for datasets, citing append-only as an important rule for Machine Learning use cases. We've encountered many other use cases where append-only is a necessity, so I think it makes sense to do it now.
Why it's wrong and bad
Chucking
appendOnly
in as a struct field is a shortsighted footgun. We should properly plan this feature out, looking for other edit restrictions & structure this as a configuration instance in a broader category of inter-snapshot validation.If anyone has time or energy to invest in this better solution, I'm interested. But we have neither time nor energy, and append-only is such a frequently-cited use that I'm a fan of having it as an explicit flag that doesn't require me to read through, like, all the documentation. Deadline to object to us using this footgun is 2018-04-13. Comment on this issue to prevent footgunning.
How we gonna do it
Any dataset writes must pass through
dsfs.CreateDataset
. CreateDataset rejects whenvalidate.Dataset
returns an error. So we need a few things:Given that it's validation oriented, the natural place to put the
appendOnly
flag is indataset.Structure.Schema
. The field itself is a jsonschema, and jsonschema doesn't support this kind of validation out of the box. There is some tangentially-related discussion regarding changing jsonschema for off-label uses here: https://github.com/json-schema-org/json-schema-spec/issues/309Thankfully, our jsonschema package comes with support for custom validators. So let's implement this as a custom jsonschema validator, registered under the
appendOnly
keyword that accepts a bool value. The validator should use a factory function pattern to generate validator instances that check against an existing dataset. Put another way, on each call todataset.Validate
, we need to generate a newappendOnly
validator from the dataset we were given, because the check has to happen in context of a snapshot. This is the nasty part of this job, but will be worth it. In the future we can use this technique to do all sorts of inter-snapshot checks, and be able to control performance characteristics on the spot.So this is also going to require a PR on jsonschema. I'll file that issue later. Basic gist is we need to change this to use some sort of decoder-generator pattern instead of a hardcoded
DefaultValidator
thingie.Side Effects
dataset.PrevPath
to be a*dataset.Dataset
that can carry an in-memory reference. We should work to make sure these slowdowns only trigger whenappendOnly
is in play.As per our Q2 goals, I'm not allowed to ship this issue. So someone else is going to have to step up 😄