talis / tripod-php

Object Graph Mapper for managing RDF data in Mongo
MIT License
29 stars 4 forks source link

Allow views and tables to self-update to the latest config #101

Open scaleupcto opened 8 years ago

scaleupcto commented 8 years ago

At the moment, if a table or view specification changes, it is a manual process to re-run generation and upgrade data to the latest spec.

Instead, allow tripod to automatically upgrade data when it encounters data not generated with the latest spec.

It is acceptable that this is an eventually consistent feature, seeing as views and tables are generally considered eventually consistent.

scaleupcto commented 8 years ago

Candidate approach:

Each view and table row should have a new meta field, _specHash. This is a hash of the specification that was used to generate the data.

There will be two strategies for updating tables and views:

1) As data is read, the _specHash is compared to the current hash of the config. If it does not match, the data is scheduled for regeneration. 2) A background process queries the database periodically to detect stale data. This is required because of filters/counts applied to table row sets that might never actually be read otherwise.

Finally, any regenerations required should go to a specific queue and be done in the background so we can effectively throttle the rate of self-update.

scaleupcto commented 8 years ago

One thing we need to consider is how to come from a large data set with no _specHashs at all. Rather than blindly regen everything perhaps we can have a script that adds the current config's spec hash before enabling the self-updating functionality.

rsinger commented 8 years ago

Theoretically, in the absence of the hash, Tripod could generate a hash and it should be the same.

junglebarry commented 8 years ago

Do we need to use hashing? An alternative might be to persist versions of the table spec, and treat it as immutable, then each view/tablerow could contain a reference to the specific version of the tablespec used to generate it. It might be more traceable if we followed that approach, and would permit easier rollbacks.

A just-in-time migration would be possible, and would, I think, avoid the problem of "no _specHashs at all". Assuming no migration, when the code is first deployed:

Recalculation need not be happening at this point, as the hash/version can be considered equal (i.e. null === null).

When a tablespec is updated, we will add the version/hash. Now, as rows/views are fetched, we can tell that they are out of date: the tablespec's non-null hash/version will not equal the row/view's null hash/version.

junglebarry commented 8 years ago

I've conflated two issues. We could use hashes and immutable versions - it's just the version identifier. I just thought the (small) expense of hashing seemed unnecessary if a simple version numbering scheme would work. Is there a strong case for hashing as opposed to atomic counters? I suppose atomic counters come with consistency problems, whereas hashes do not...

scaleupcto commented 8 years ago

Historically the reason we haven't stored specs in the database is you need to read all the specs from the database before you can actually do anything. And local caching isn't good enough - especially if we are hashing versions - because it causes chaos in the stale cache window.

Consider a piece of frequently read data - one node has a stale spec cache and regenerates specs to an old version whilst another has fresh spec and puts it back again - the nodes all play tennis until they are operating on the same spec.

So the alternative would be either always read the specs on each request (expensive) or look at some kind of coordinated distributed config service such as Zookeeper (extra complexity).

My gut tells me that although ultimately something designed for this problem like Zookeeper is the right way to solve this, it's not something we should introduce lightly and with a major feature change like this at the same time.

You also complicate the release process - how will you co-ordinate the updates to the spec in the DB with the rollout of new code (that relies on it) to N nodes?

So, in summary, my push to keep the specs on disc with the app code they relate to at the moment are:

(a) it's fast to calculate, something like an md5 of the spec on disk is quick and reliable enough to then compare with what is in returned data (b) requires no more database hits than today. Read/write performance shouldn't be impacted (c) no more moving parts today. The complication is in automating the sync process so the risk to running code and side effects is low. (d) you don't complicate a release - because the spec lives with the distro releasing the distro is the only step to upgrade/downgrade a step, otherwise you have to sync a release with a database update of the spec, and seeing as releases happen in flight you can never rely on the spec in the db and the code being in sync

scaleupcto commented 8 years ago

On hashing and atomic counters - I prefer hashing. Three reasons -

1) the fact that the spec is different matters, not that it is a newer/older version. This seems closer to hashing. To atomically increment everyone needs to be in agreement what the old version was. Typical distributed system hard problem. Unless you need it, avoid. See point 1. 2) Format changes don't increment the hash. For example converting the spec to a php array and hashing that removes a change like whitespace addition or changes from single to double quotes from the equation and means a hash of two semantically equivalent specs with different formatting is the same hash 3) Don't have to rely on someone deciding that a new version is a new version (i.e. a dev). The system decides (via the hash) what a new version is. This automates away the human and is good.

junglebarry commented 8 years ago

I agree with most of the above.

I have a question, though: Is there only one instance of the code running at once?

Consider, for example, you have two machines running the code, and due to failed release - or just a slow rolling release - they reach a state where the spec files have different hashes. Couldn't you get to a state where each machine thinks it holds the "right" hash, and they enter a mutually-recursive loop to "correct" hashes generated by the other machine?

In this case, order provides a natural resolution to the problem, where difference alone does not. That does not magically resolve the other problems of ordering you rightly raise; it just trades some off against others. I'm assuming that the risk of having such a scenario is considered low or impossible, compared to what might arise through versioning (I agree: that seems to have more opportunity to fail).

However, if it's a possibility we should consider it, at least so we know what problem scenarios might arise on distribution. On Sat, 14 May 2016 at 11:52, Chris Clarke notifications@github.com wrote:

On hashing and atomic counters - I prefer hashing. Three reasons -

1) the fact that the spec is different matters, not that it is a newer/older version. This seems closer to hashing. To atomically increment everyone needs to be in agreement what the old version was. Typical distributed system hard problem. Unless you need it, avoid. See point 1. 2) Format changes don't increment the hash. For example converting the spec to a php array and hashing that removes a change like whitespace addition or changes from single to double quotes from the equation and means a hash of two semantically equivalent specs with different formatting is the same hash 3) Don't have to rely on someone deciding that a new version is a new version (i.e. a dev). The system decides (via the hash) what a new version is. This automates away the human and is good.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/talis/tripod-php/issues/101#issuecomment-219213924

scaleupcto commented 8 years ago

Absolutely true. Not considered that scenario.

scaleupcto commented 8 years ago

In progress => https://github.com/talis/tripod-php/pull/102