Updates / createItems Improvements

webteckie commented 9 years ago

I'm going to open this issue to jot down all requirements with the hope of implementing an updates / createItems flow that meets various use cases in a way that maintains data consistency and integrity. Feel free to add on if you have other updates/createItems requirements.

Updates/createItems should resolve relationship dependencies across updates files

updates-1.js:
Thing-1: [
    {item-1: name: "cheese pizza", __ref: item_cheese_pizza}
]

updates-2.js:
Thing-2: [
    {item-1_ref: "item_cheese_pizza"
]

NOTE: one possible approach is to pass in the refs lookup into createItems so as to let the updates module itself keep track of the refs lookup array across updates.

Updates/createItems should resolve relationship dependencies from DB

updates-1.js:
Thing-1: [
    {item-1: name: "cheese pizza"}
]

updates-2.js:
Thing-2: [
    {item-1_ref: "cheese-pizza"}
]

NOTE: here a DEV would need to ensure that Thing-1 uses slugs as well as that Thing-1::item-1 will, indeed, resolve to a slug value of "cheese-pizza".

Updates/createItems should validate / detect errors/integrity issues before committing any items to the DB.

NOTE: one possible approach is to do a first pass over createItems with a "save = false" argument that prevents items from getting saved. To address requirement 2 above a slugs lookup array should also be kept across updates(?).

Updates/createItems should do one atomic save per item. Currently, because it does 2 passes over the items, it does one save and then when it resolves relationships it does another save of the same document. However, there isn't any guarantees that the first save did not mutate the document using a pre/post save hook. If that happens when the 2nd save may run into issues since pre/post save hooks may assume one atomic save operation when going about their logic.

NOTE: I have issues with this!

estilles commented 9 years ago

Comments and Suggestions

These are very interesting suggestions. Here're my two cents on each one of them ...

Updates/createItems should resolve relationship dependencies across updates files

I completely agree with this proposed enhancement. However, instead of the suggested approach to pass the all the discovered __refs to createItems(), I propose creating a singleton module/service that would cache the _refs and make them available upon request. The service would have methods like .add(), .get(), and .exists() to save, retrieve and verify the existence of a __ref.

Updates/createItems should resolve relationship dependencies from DB

I like this feature, but if implemented it should be optional, to make it backwards compatible. For example, we could keep __ref for literal references and add a new __refFrom option specifying the field from which the reference would be retrieved. It would be very important the document that the __refFrom field MUST unique. Exceptions should be raised if the fields doesn't exists or if the field is not unique.

We could also slugify the __refFrom as suggested by @webteckie.

Updates/createItems should validate / detect errors/integrity issues before committing any items to the DB.

This can be easily implemented just as described by @webteckie in his writeup.

Updates/createItems should do one atomic save per item.

On the one hand I understand the reasoning behind the two-phased approached implemented by @JedWatson, which gathers all __ref data during the create items phase, then links any relationships found during the link items phase. This approach is simple, elegant and functional.

On the other hand, I understand @webteckie's frustration when trying to post process saved document and having to deal with the unresolved relationships during the first phase. (We should have done a better job of documenting idiosyncrasies such as this.)

I had have actually encountered this in the past. In my case, since I was aware of the dual-pass approached used by createItems() my .post('save') middleware simply included logic to defer any action until after the relationships had been resolved.

That said, we could easily change createItems() to use either a multi-pass or a recursive algorithm that would save the items as their __ref dependencies as resolved. In other words, the first pass would save any items that have no dependencies, subsequent passes would save items with relationship dependencies that were resolved on prior passes, until either all items are saved or the only items left have relationships that were never resolved.

While the approach I described above is feasible, it does not account for possible circular relationship references. @JedWatson's two-phased approach does.

Given that the problems created by the current implementation can easily be solved by using logic that is aware if it, I'm not sure if it's worth the trouble.

Summary

Issue	Possible Solution	Should Implement
Resolve `__ref` across updates	Implement a `_ref` caching service	Yes
Resolve `__ref` from database	Implement `_refFrom` option	Yes
Detect errors prior saving items	Implement { save: false } option	Yes
Single atomic save	Refactor `createItems()` into a <br/ >multi-pass or recursive algorithm	Not yet convinced

webteckie commented 9 years ago

@JohnnyEstilles:

Regarding the first one, but isn't that overkill since the updates process is very short live? But I don't have an issue with it :-)

Regarding the second one, yes, it should be optional.

Regarding the last one, yeah, there's no question the current solution is elegant but it does have pitfalls! I still have not been able to come up with a good pre/post-save hook solution for my case. And when a lot of people may run into obscure issues like that then the solution should be revisited to attempt to avoid these.

estilles commented 9 years ago

@webteckie:

Regarding the first one, but isn't that overkill since the updates process is very short live? But I don't have an issue with it :-)

I meant an in-memory cache. Just want to avoid passing the __refs back and forth between updates.js and createItems.js. IMHO, that's not overkill. Plus, I'm a fan of modularity. It makes unit testing so much easier. :-)

Regarding the last one, yeah, there's no question the current solution is elegant but it does have pitfalls! I still have not been able to come up with a good pre/post-save hook solution for my case. And when a lot of people may run into obscure issues like that then the solution should be revisited to attempt to avoid these.

Just because I'm not totally convinced doesn't mean its not a good idea. I'm sure we can come up with a solution that is backwards compatible and doesn't introduce any other problem (e.g. inability to handle circular relationships). If we do, I'm in.

estilles commented 9 years ago

@webteckie:

In your case, do the pre-save/post-save hooks fire twice for items that need relationships populated? (i.e. once with the relationship undefined and a second with the relationship populated)

webteckie commented 9 years ago

@JohnnyEstilles yeah, that's precisely the issue; they fire twice!

webteckie commented 9 years ago

@JohnnyEstilles and, yeah, ok, I see your point about the __ref cache. Sounds good to me :-)

estilles commented 9 years ago

@webteckie, just out of curiosity, what would be the effect to ignore the pre-save/post-save event when the relationship field is undefined?

I'm not trying to convince you. I'm just trying to figure out if that's a viable solution until we come up with a more permanent one.

webteckie commented 9 years ago

@JohnnyEstilles the thing is that based on the overall status of the model instance I need to do some stuff with those relationship references! Since the first save is already messing around with the state of the model then the 2nd save does not work as expected!

SophieV commented 9 years ago

What's the status of this discussion ?

I am using separate update files to seed the database. Specifying a single - existing - reference was easy, unlike specifying a list of - existing - references.

Based on the createItems.js file, I figured out that I could use the following syntax for a single reference: keystone.createItems({ Thing: [{ "myRelationshipField": {id: associatedDoc._id}, (associatedDoc comes from a mongoose query that retrieves a document inserted by another update file)

however, it got tricky when dealing with an array of related documents. I ended up having to pass an options object to createItems(data, options, callback). Its nested structure looks like: keystoneRefOptions['refs'][Thing.key][thingRef] = {id: doc._id}; (thingRef here is a custom value that just needs to be a match in data and in options)

It would have been great to be able to specify the array directly in the following way: keystone.createItems({ Thing: [{ "myRelationshipField": [{id: associatedDoc1._id},{id: associatedDoc2._id}],

Is it being considered ? Am I missing a more straightforward way ? Thanks !

webteckie commented 9 years ago

@JohnnyEstilles @JedWatson should we try to do some of these improvements? In my case, I ended up keeping my own versions of updates/createItems which I call myself. Keystone is flexible in this and so it isn't that high of a priority to code something that still may not meet everyone's needs. But we can try to improve what's there and perhaps allow for using options to control the desired behavior. I know jed also wanted to separate this into it's own module. If you are still up for this then perhaps one of you with collaborator rights should probably do the leg work of creating the separate module using the existing files and then we can improve from there.

estilles commented 8 years ago

@webteckie ... I've been looking into this. I'll update the issue soon.

creynders commented 8 years ago

Maybe I'm missing something here, but why exactly is a 2nd save necessary? Just create all the items on the first pass, store them in a map by id, resolve the dependencies and then iterate over all the items of the map and save them one by one

webteckie commented 8 years ago

@JohnnyEstilles good to hear...and welcome back :-) @creynders IMO, the difficulty here is processing the relationships (I think they rely on the mongo _id, which I think you only get by saving the list items in the first place...thus the two save passes). The double save does cause complications in that lists may do things to the items in the pre/post save hooks. For example, if pre/post save hooks increment some sort of counter then that counter will be incremented twice, instead of just one as it would normally happen.

creynders commented 8 years ago

@Webteckie that's a misunderstanding. If you create a document with new it immediately has an id. I do it in my update scripts all the time. I.e. There's no need for a 2nd save.

creynders commented 8 years ago

Here for instance:https://github.com/d-pac/d-pac.cms/blob/db39bbecd71a2c8b7ceb8836320d16a7c28f3c78/batch/copyAssessment.js#L28 I create an Assessment, but need its id to create other stuff too that have a relationship to it. Everything gets pushed into a save queue and saved one-by-one.

estilles commented 8 years ago

@creynders ... You're absolutely correct. However, the original issue that @webteckie brought up was the fact that, at the moment, we can't reference creates items across multiple update files. Each update file if process individually, and references to any items creates are lost across updates.

webteckie commented 8 years ago

@creynders ok, I stand corrected on my assumption that it was mongo that allocated the IDs. So not really sure of the real reason @JedWatson decided on 2 passes :-) So if that gets fixed that would also fixed the issues I was having with double saves on the pre/post save hooks. @JohnnyEstilles are you thinking of the in-memory cache approach to resolve cross-file update references? If so, what if an updates file relies on data already posted to the DB sometime back (not in the current update set)? Would it make sense to, instead, create an updates collection that contains persistent metadata (__ref to _id mapping) about updates, which can then be used for future updates? You may have an elegant solution already...just throwing things out there :-)

creynders commented 8 years ago

@JohnnyEstilles I see. The best solution would be to have a consistent way to convert a ref value to an id I think. All the other approaches sound very brittle and error-prone IMO.

estilles commented 8 years ago

The best solution would be to have a consistent way to convert a ref value to an id I think. All the other approaches sound very brittle and error-prone IMO.

All the solutions we're proposing basically (including the existing implementation) create a map/dictionary to convert the ref into an id (once the id is available). The intention is not to change that (not exactly sure how that would make any of the proposed solution brittle/error-prone). What we're proposing is some way to batch the processing of the updates so that refs from one update are available to succeeding updates, preferably in a single process (@JedWatson outlined the issues with the updates running on different processes).

Now @webteckie threw another wrench into the mix, suggestion we should also be able to back-reference refs from items added by updates on previous runs. I personally do have any use for that feature at the moment, but I can definitely see it's usefulness.

My intention was to develop an independent (Keystone agnostic) module that could be imported and use by Keystone (or anything else for that matter).

creynders commented 8 years ago

@JohnnyEstilles yeah I was talking about the proposed approaches to backreferencing seeming error-prone to me, and they all come down to trying to reproduce what's already there: an (enforced) unique, indexed identifier. Which is why I think we should try the opposite approach: use what's already there.

E.g. I would aim for something like this:

//0.0.1-create-user.js
createItems({
  User:[{
    id: createId('John Cleese'),
   name: 'John Cleese'
  }]
});
//0.0.2-create-post.js
createItems({
  Post: [{
    id: createId('Best post ever'),
    title: 'Best post ever',
    author: createId('John Cleese')
  }]
});

No extra fields, no hassle with keeping ref consistent etc. createId would be a function that takes a string and turns it into an id compatible with ObjectId. And if that's not possible, we could make it async retrieving/creating id's from a lookup table as you proposed, however I definitely think we should get rid of all extra fields: ref, __refFrom. There really is no need for them, because even when reading a json file with all the data (i.e. w/o direct function calls) it's really trivial to let the developer handle it himself, e.g.:

//0.0.0-update-something
var myData = require('./mydata.json');
myData.forEach(function( item){
  item.id = createId(item.foo);
  item.someRelationship = createId(item.qux);
});
//use myData

It would make things a LOT simpler IMO.

estilles commented 8 years ago

@creynders ... yeah ... I guess we were talking about completely different issues.

Regarding the back-references issue you refer to, using createId('John Cleese') vs. _ref: 'john_cleese' is just as error prone. Automating it as you suggest towards the end of your comment would definitely be less prone to errors, but would make it considerably less readable. I suppose one could make an argument for less errors vs. better readability.

My intention was to leave back referencing as is, so update files would be backwards compatible.

webteckie commented 8 years ago

@JohnnyEstilles the "monkey wrench" I threw in to the mix I think can be resolved by allowing updates to reference already posted items (e.g., slugs or whatever...perhaps this is a config item of the updates process), which I think is already listed in the summary table you created above (2nd row). Btw, perhaps do these things independently and in incremental order from the trivial ones (like creating the plugin module and eliminating the double saves) to the more complex ones (which perhaps require more thought and discussion).

estilles commented 8 years ago

@webteckie ... No worries about the "monkey wrench". I personally think it's a good feature to have. I need this for another project as well, so that's why I want to make it Keystone agnostic. Hopefully I'll have some time to work on in this weekend. I have a bunch of guys on Holiday so we're really busy in the office.

keystonejs / keystone-classic

Updates / createItems Improvements #990

Comments and Suggestions

Summary