RFC - DataObject ID Decoupling

dhensby commented 6 years ago

Proposal to remove dependency on auto-incrementing Database driven IDs

Motivations

Remove race conditions on inserts / ID dependencies Inserting and pulling out auto-incrementing IDs can cause race conditions when using distributed architecture, especially with distributed databases
Pre-determined IDs before writing At the moment, to get an ID you have to write the item to the DB first, this is bad practice as all Models should have an ID at all times; this can also allow us to simplify "unsaved relations" because we'll be able to set a relation on an ID that's not been written yet, allowing us to depend on eventual consistency.
Security: Don't implicitly expose existing IDs Incrementing IDs are guessable and easy to determine if an ID will exist. As IDs are often used in URLs, they should not be easily guessable
Better database compatibility We decouple from DB integrations and allow people to define their own ID scheme; our Model shouldn't depend upon the database to work

Impact

Have to either upgrade existing IDs or work with both numeric and UUIDs
Have to fix any is_numeric logic from ID validation / make ID validation a bit more abstract
Stop ordering items by ID to get the inserted order (use Created instead)
Modify isInDB() / exists() methods to not just assume a model exists if it's got an ID (either we need to ping the DB or we need to set a flag on loading from DB / writing) - Perhaps replace isInDB() with isPersisted()?

Approach

Create an "ID Generator" interface This would have the responsibility of generating and validation IDs so that this is abstracted away and we no longer need to do is_numeric($this->ID) to validate IDs.
Create ID Generators By default we'd have a UUID generator but we could also have a "legacy" generator. This could be a hybrid that accepts numeric IDs and UUIDs ~~or just a numeric auto-incrementing ID generator~~.

robbieaverill commented 6 years ago

I like the idea for a few reasons:

a UUID will always be unique, which is good for highly concurrent writes
we can stop having to write models in order to get the ID for it (then do something and write it again later)

One concern would be that SilverStripe URLs will become much longer and less human friendly

dhensby commented 6 years ago

One concern would be that SilverStripe URLs will become much longer and less human friendly

Well, an integer is almost as meaningless to a human as a UUID, but URLs in the CMS will become longer, but that's an aesthetics issue and UUIDs are not there for aesthetics, they are there for uniqueness.

I have also realised that we can't actually write an incrementing ID generator in the application layer because we would still need to write a row to the database otherwise we'd always generate the same ID on multiple instantiations. The solution would either be to write the object on construct (a massive overhead) or not support auto-incrementing IDs.

This will have to be SS5 change

robbieaverill commented 6 years ago

Yeah - I guess in the frontend you wouldn't be encouraging the use of IDs in URLs anyway, e.g. blog and CMS pages etc which uses a URL segment

tractorcow commented 6 years ago

Your ID generator could simply be a table that stores an auto-incrementing ID, hypothetically, if the user really wanted to force integers.

ScopeyNZ commented 6 years ago

In PostgreSQL you don't even need the table, just the sequence 🙂 .

lekoala commented 6 years ago

If I may add 3 little things:

For urls (or usage in html), I really like having base62 encoded ids. It gives something like 6a630O1jrtMjCrQDyG3D3O which is not too bad in terms of aesthetics
I also really like the Stripe approach, where all objects have a 3 letters prefix, like usr_someid, pag_someid... It can help with uniqueness and simplify checks as well.
There is also some performance consideration to take into account (ordered ids, binary storage...)

ps: if need be, I've already a working uuid extension for silverstripe 4 here https://github.com/lekoala/silverstripe-uuid

sminnee commented 6 years ago

I can see that UUIDs are useful in some circumstances, and totally support us providing support for them.

However, I dispute whether auto-incrementing IDs are legacy. I would still see auto-incrementing IDs as an effective tool in most cases that don't include a multi-master database.

Pre-determined IDs before writing At the moment, to get an ID you have to write the item to the DB first, this is bad practice as all Models should have an ID at all times; this can also allow us to simplify "unsaved relations" because we'll be able to set a relation on an ID that's not been written yet, allowing us to depend on eventual consistency.

We have to be extremely careful here that we don't introduce race conditions with this. For example, generating an auto-incrementing ID on the PHP side would be extremely bad practise. Generating a UUID on the PHP side is okay because the chance of collision is very small. Other random identifiers might be acceptably generated on the PHP side, although you would probably need to have a special "ID collision" error case.

So the whole first-write-gets-ID approach will stay with us regardless, unless we drop numeric IDs altogether. You could arguably have the sequence generated separately from the initial record, although in simpler cases – i.e. where you are simply writing a record with no related data – this will slow things down, as you'll now have a separate sequence write and record write. Perhaps these can be optimised?

Better database compatibility We decouple from DB integrations and allow people to define their own ID scheme; our Model shouldn't depend upon the database to work

I think this is a very small part of such an abstraction and if this is really a goal I would suggest we design that first – a shift from Active Record to more of a Data Mapper. I'm moderately supportive of this: I think it's a great idea I just don't know if it's the most important Big Refactor to be doing. I would support anything that dramatically speeds up GraphQL, tree fetches, or permission checks; maybe it overlaps with that. ;-)

sminnee commented 6 years ago

Modify isInDB() / exists() methods to not just assume a model exists if it's got an ID.

This is pretty straightforward; it can be a boolean field that is:

False on new construction
True on construction from DB content
Set to true on write.

The hardest thing is that currently we don't clearly distinguish a construction-from-db-content, although that's only hard because it requires a small API change.

dhensby commented 6 years ago

@sminnee Whilst I appreciate what you're saying about numeric IDs (they are effective for non-distributed DBs) the point is that our Model is still (needlessly) coupled to the DB layer in this instance. It also means we continue to have the other problems of guessable IDs (which could prove to be a security vulnerability for applications because providing sequential numeric IDs means it's very easy to determine which IDs do (or have) existed in the past if you know a current valid ID).

I also would not like to support both IDs that can be generated in either the DB side or the application side because this means we'll run a different DB schema depending on the setting of the application (we'd have to switch the column type and the AUTO_INCREMENT setting on the ID column), it would also make it impractical to scale a site (ie: move it from single server to distributed server infrastructure) as it'd require a complete remapping of IDs on every single table.

sminnee commented 6 years ago

OK, so in effect we'd drop simple numeric ID generation altogether, and probably shift ID storage to binary(16) fields. The main thing that I'd want to know is whether our developer community is on board with such a shift. It feels a bit like namespacing in that it would be an "eat your vegies" change for a lot of people — seen by some as complicating the developer experience, but acknowledged that it's necessary. However, I'm not sure whether shifting to UUIDs is as necessary as shifting to namespaces, and I'm not sure how developers would respond to ORM that says "no incrementing IDs, only UUIDs" – it might come across as very heavyweight.

If the goal is decoupling the model from the database, then I would be inclined to do that in a more complete manner with a datamapper style abstraction, instead of the current active record. This would allow for both UUID-based and AUTOINCREMENT-based repositories, if we wanted, as this responsibility would lie within the mapper. It would also make DataObject more of a convenience for predefining persistence metadata rather than a base class of everything that can be persisted – it would be easier to persist Plain Old PHP Objects if needs be.

If the goal is to allow for non-incrementing IDs for better multi-master support and security, then I would be more inclined to go with an implementation that supported both auto-increment (MySQL) / sequence (PGSQL) based IDs and PHP-generated UUIDs, and probably by default generating the UUID on first write as we do for auto-increment IDs, to leave that semantics more closely matching the current implementation. A method like initPrimaryKey() could be provided that would be a lot quicker on a UUID implementation but would also work with auto-increment keys. I'm not sure about the speed of UUID generation but it's also not clear if there are downsides to generating them for in-memory "scratchpad" objects.

dhensby commented 6 years ago

The main thing that I'd want to know is whether our developer community is on board with such a shift.

Yes, I'd like to know too; though it may be one of those things that only really bothers / affects those that have hit its limitations.

I'm not sure how much developers really notice what ID an object has; how often are IDs hard coded into logic and (if they are) is that a good practice and justification for making them "readable".

As for moving to a complete abstraction with a data-mapper, yes, that would be good, but what I'm looking to do is, over the course of a decent period of time, lay the groundwork that is required for that rather than trying to do it all in one go. There seems to be a little bit of incremental working going into the framework that is using #6632 as it's motivation.

I recently outlined a few "blockers" that would stop us even getting onto that (one of which is a datamapper/proper abstraction layer between our objects and the persistence layer).

tl;dr I'm out to make incremental improvements to a higher end goal over a single massive change that will ship all as one. Part of that is because of time constraints but part of it is also because one huge PR like that is very difficult to keep up to date with the velocity that the framework has at the moment. As long as each individual change is an improvement then that should be fine.

dhensby commented 6 years ago

PS: I'm trying to get sign off for the individual areas I'll need to address to make #6632 a reality, a datamapper is a step change ahead of where I'd be starting when laying the groundwork for it

sminnee commented 6 years ago

That being the case I would probably do this, which should be 4-appropriate change.

Abstract ID management into some kind of service that can be set per type, as well as having a project default
Provide an explicit DataObject::initPrimaryKey() method, whose API expectations allows for the possibility of a write(), that is always called before the first write but may be called earlier by project code if desired.
There will be a bunch of other methods to figure out for the service. It will need to augmentDatabase() in some way, as well as validate IDs, assign IDs to a DataObject, etc
Note that assigning an ID to a DataObject will probably need to be void assignIdTo($dataObject) rather than getIdFor($dataObject) to allow for an auto-incrementing option.
Build auto-incrementing and UUID-based implementations of this. Set auto-incrementing as the default. Provide docs for switching to UUID for some models in yaml, for custom models in PHP, or for a whole project.
Refactor SS4 internal code to use the new service, so that switching to UUID doesn't break anything.

This will basically create the situation where we can support UUIDs without dropping auto-increments, in SS4, which will be a good way of determining how feasible it is to go UUID-only in the future.

NightJar commented 6 years ago

A conversation around this has just developed in https://github.com/dnadesign/silverstripe-elemental/pull/429 - in the context of writing (assumed) IDs into unit tests, in the manner that we know how the system works and have control over the data per test case. This works well, although does litter the unit test with magic numbers a little.

The case was brought up that assuming IDs is bad in that if the ID were to ever change (e.g. this RFC), that would mean there is needless refactoring within the test. My response was as follows:

However I'd like to also point out that while I can agree that one should code to an abstraction rather than an implementation, in order to switch the IDs out to e.g. UUID there would need to be much more work than simply altering these tests - all ID's are currently typed as Integers, and GraphQL is strongly typed. It is not simply tests that would break.

E.g. the GraphQL around this currently expects an Integer as an ID. Ok so if we change that by having the core update the integer type globally for all GraphQL operations... but then what to do about the case of the invalid ID 0 currently being used to denote "at the start" in this particular mutation? (as opposed to the default undefined being "at the end") Things are never quite so clear cut, and while the burden of maintenance can be lessened, it'll never be gone.

The crux of it is that changing to UUID (strings) would be far more burdensome than simply causing some fast and fancy-free coded tests to fail. IDs are intended to be an abstraction, and for the most part are throughout the application suite. However there are a number of cases where they're not... strongly typed GraphQL is one that immediately springs to mind... and I'm not sure I'd label the code within as "technical debt" in that it's not poor code as a assumptive test is - although it would immediately become such upon a conversion to UUID.

Not to say it can't be done (I support it), but there is a much larger consideration to be had than simply the coupling of the model and storage layers (and the storage layer particularly to MySQL over others).

sminnee commented 6 years ago

Some specific ideas about implementation (for discussion / refutation):

Ids are provided by a service implementing an SilverStripe\ORM\IdProvider interface
DataObject looks up SilverStripe\ORM\IdProvider.<class> from the service injector, failing over to SilverStripe\ORM\IdProvider. would be the base data class.
Core comes with an AutoIncrementIdProvider, a UuidIdProvider and possibly a SequenceIdProvider for database like PGSQL that handle it that way
4.x uses AutoIncrementIdProvider by default. Instruction for switching all/some tables to UuidIdProvider are available.
Possibility that UUID=1 is included as an extra build configuration in travis?

I'm not sure what methods the IdProvider should expose; I would suggest that this is figured out by the people implementing AutoIncrementIdProvider and UuidIdProvider, ensuring that it covers the needs of both these providers.

sminnee commented 6 years ago

@silverstripe/core-team esp @dhensby what do you think of my suggestions in the last couple of comments?

flamerohr commented 6 years ago

I like the idea of an IdProvider, this could help abstract ID handling cleanly.

GraphQL has a specific ID type for IDs, and is treated as a String - a non-readable String, rather than INT. So I think switching UUID should be trivial for this part. I can see the switch to UUID will need a bit of refactor in react component prop-type expectations though, I can find a few places where it expects a "number" but it's really an "id"

sminnee commented 6 years ago

Good reminder about the front-end impact of any such change, Chris :-) 👍

robbieaverill commented 6 years ago

Sounds pretty good to me! I think it’d be worth having a mini vote on the class and namespace naming though, it has the potential so have a lot of Providers in it

sminnee commented 6 years ago

Yeah potentially there's an IdProvider sub-namespace? One thing I wondered is whether the interface should be in the parent namespace, though, mainly for brevity, but also because I wince every time I write use SilverStripe\Core\Injector\Injector. ;-)

dhensby commented 6 years ago

Thanks for all the thoughts and feedback, especially with regards to frontend components that could be making assumptions about ID types.

It's not clear to me how our application could keep track of an auto-incrementing ID counter across process threads unless we had some kind of dedicated persistence storage for the ID increment-er, which is really gross and undermines the objective of being able to run SS projects on distributed hardware.

I'm wondering if this is an area where we should just be opinionated and provide UUIDs without an ability to easily override it (just like we are currently opinionated) but being opinionated about UUIDs just means we are able to decouple our ID generation from the persistence layer

sminnee commented 6 years ago

I think there's some confusion here: the AutoIncrementIdProvider should behave as close to the current behaviour as possible, should be the default, and these two facts mean that this can be introduced in 4.x.

Specifically, I'd imagine something like:

AutoIncrementIdProvider::augmentTable() would add an autoincrementing ID to the table
AutoIncrementIdProvider::augmentSQL() would select "ID"
AutoIncrementIdProvider::generateId($dataObject) would perform an initial write on the dataobject, this possibility would be stated in the interface spec.

Conversely:

UuidIdProvider::augmentTable() would add a binary(16) to the table
AutoIncrementIdProvider::augmentSQL() would select something like HEX("ID") AS "ID"
AutoIncrementIdProvider::generateId($dataObject) would generate a UUID without a DB write

It's probable that the final code pushes more responsibility to relevant DBField implementations, so we'd have a DBUuid class that UuidIdProvider makes the most use of.

To respond to your comments:

I'm wondering if this is an area where we should just be opinionated and provide UUIDs without an ability to easily override it.

I disagree with this for a number of reasons:

It's not clear that moving to UUIDs are in all or even most users' interests. It may be true but "betting the farm" on this is clearly a weaker approach than allowing it as an option and seeing the results of that experiment.
We'd have to embargo this feature until 5, which delays it by a couple of years & creates a lot more risk about getting the system right the first time.
"Being opinionated" as a justification less weight when we're in the midst of changing our opinions. ;)

Unless we had some kind of dedicated persistence storage for the ID increment-er

We've got a tried & true solution for this: autoincrementing columns. No need to change this.

which is really gross and undermines the objective of being able to run SS projects on distributed hardware.

SilverStripe has run for more than a decade without UUIDs, so any argument that continuing to support the current behaviour is going to be impractical is on shaky ground.

My expectation would be that those teams running a distributed database would be strong recommended to use the UuidIdProvider, as would anyone who doesn't want to carry the pre-emptive write of the AutoIncrementIdProvider. UuidIdProvider would therefore have advantages, as well as the disadvantages of longer, less meainingful IDs.

By the time we're releasing SS5, we can get some real data about how popular the UUID provider has been and what kind of issues people have run into making use of it. Then we can decide whether to:

Make UUID default
Deprecated AutoIncrement
Delete AutoIncrement entirely (which should probably be preceded by deprecating it in a future 4.x release)

madmatt commented 5 years ago

I strongly support the ability to have both if we have to do anything, with UUIDs being opt-in (at least for now).

UUIDs definitely have a place, but I don't think the majority of SilverStripe sites would be helped much by them, and they can make things more complicated too (for example it basically becomes impossible to look at the database directly and scan for new records or trace one DB write across multiple tables - IDs are memorable, UUIDs aren't).

@sminnee: In your proposal above I'm not sure how you see $has_one relationships being handled - would you expect separate columns e.g. ParentID and ParentUUID? Having some objects using UUIDs and others using Ints would be challenging, and would be a massive pain to ever switch (e.g. using extensions to change the Member table to using UUIDs - you'd have to re-write every single link to Member to use UUIDs everywhere.

chillu commented 5 years ago

I'm with Sam here: I don't think it's a very important refactor, because the group of devs getting excited about this would be very small - teams which run large write heavy multi-master setups who are currently experiencing performance or locking issues. If it's a required step on a DBAL refactor, and on the path to reinventing less wheels in SilverStripe over time, that's more interesting ;)

I'm a bit concerned on making this behaviour type specific, and allowing different approaches in the same database. While I agree that the data mapper pattern is a great way to avoid one-size-fits-all databases, in terms of identifiers the types aren't isolated. Types link to other types via foreign keys, and I'd find it weird to have a numeric primary key and a UUID foreign key in the same row?

What's the migration path for this? At minimum, it requires updating every row in the database, and relying on a somewhat shaky ORM level referential integrity on identifier rewrites (foreign keys etc). Then there's shortcodes, and a million other cases where identifiers are coded into different stores - for examples things like https://github.com/phptek/silverstripe-jsontext which will need special treatment. It's all solveable, but we should be realistic about work vs. payoff here.

lekoala commented 5 years ago

Maybe there is no need to replace IDs by UUIDs, but simply have both? this way, migration is easy, and you can still have pre determined ID's and all other benefits. Also performance might be a good point : using plain int is always going to be faster than UUIDs.

patricknelson commented 5 years ago

Potentially stating the obvious after reading everything above, but:

Modify isInDB() / exists() methods to not just assume a model exists if it's got an ID (either we need to ping the DB or we need to set a flag on loading from DB / writing) - Perhaps replace isInDB() with isPersisted()?

This may also be a good place to mention that ->exists() should point to ->isInDB() (or ->isPersisted() if refactored, now documented in issue #9352) on the root DataObject for consistency as well, since ->exists() really does rely on the same underlying logic anyway, regardless of what that implementation ultimately entails. Preferably this would happen in 4.x but I notice this is one of the few places both methods are mentioned together in an open/active issue 😄

That said, a quick point on that API nomenclature: It's also worth noting that we should probably disambiguate the meaning of ->isInDB() since I think one of the issues tied up in the multi-column index in #6819 (fixed in #8831) was the fact that the ID was always written even with a particular value (e.g. 0 if it was new for that record) which may be fine on the base record but ->isInDB() on a descendant instance $child->isInDB() in memory mid-write operation will return true even though ChildObject's table hasn't actually been written to (and thus not actually entirely in the DB), which makes it confusing.

Maybe this could mean refactoring ->isInDB() (or ->isPersisted()) to determining that a unique ID has been generated, e.g. ->hasUniqueID() or simply ->hasID()? 🤔 That would also help folks like me reading the fixes like the one in #8831 understand why that fix actually worked with a slightly more semantically accurate API. I think it would also mesh well with @sminnee's point about moving toward Data Mapper / GraphQL friendly data persistence, since in abstract, we're more concerned that we now have a unique ID to work with rather than exactly how it gets persisted (i.e. the more specific but less accurate naming of "Is it in DB yet? Ok then I must have an ID now..." expectation, which may not necessarily hold depending on your underlying DBAL or how you're generating ID's).

patricknelson commented 4 years ago

FYI: Added separate issue (#9352) to address my concern from the previous comment above regarding the redundancy of ->exists() and ->isInDB().

silverstripe / silverstripe-framework