Support merging subfields in DDP (or separate multiple subscriptions instead of merging)

jalik commented 7 years ago

Migrated from: meteor/meteor#3764

It's not possible to subscribe to a publication multiple times to get different subfields.

For example, let say we have a collection where each document have a data attribute, inside this data attribute there can be anything (temperature, speed, etc), so each document can be different and the use of a data attribute allows to put anything inside without taking the risk to override "meta attributes".

So the simple publication below :

Meteor.publish("data", objectId, fields, () => {
  return Data.find({objectId}, {fields, sort: {createdAt: -1}});
});

Would return this kind of data :

[
  {
    "objectId": "P0I6JsjAfXepz5",
    "createdAt": 1497036846026,
    "data": {
      "temp": 78.9,
      "tempUnit": "C"
  },
  {
    "objectId": "uKl9pR5m73za",
    "createdAt": 1497036848287,
    "data": {
      "humidity": 62.3,
      "humidityUnit": "%"
  }
]

Now imagine that we have a dashboard with several monitors that should return specific data for different objects, we need to subscribe multiple times to the same subscription with different params :

Meteor.subscribe("data", "uKl9pR5m73za", {objectId: 1, createdAt: 1, "data.humidity": 1, "data.humidityUnit": 1});
Meteor.subscribe("data", "P0I6JsjAfXepz5", {objectId: 1, createdAt: 1, "data.temp": 1, "data.tempUnit": 1});

This is a simple example where objects are very small but in real they contains more than 2 values in the data attribute, so we only need to get the needed ones, thus filling the fields param of the data publication to optimize the query.

The problem is that it only works for the first subscription, the following subscriptions do not return the other subfields of the document, forcing us to get the full content data even if we don't need it...

ramezrafla commented 7 years ago

I second this, but in another way. We use reactive publications, and every time the publication pushes the data again, it pushes the whole top-level field. We could use libraries https://github.com/flitbit/diff server-side for deep-diffing and https://github.com/KyleAMathews/deepmerge client-side to merge document, so we only send differentials even at the 2nd and further sublevels (not just top-level).

This relates to ddp-server, ddp-client and diff-sequence, so it's a large change. I am available to contribute, but need some guidance as I never dove deep into it.

EDIT: This is actually a big deal for enterprise-grade production apps, as we deal with a lot of data. There will be substantial bandwidth savings if we can do it efficiently. And a strong selling point for Meteor too.

ramezrafla commented 7 years ago

Also wanted to add that fixing diff-sequence would speed up Blaze tremendously. Actually, does Blaze need to be affected too any changes to diff-sequence? @mitar?

mitar commented 7 years ago

Not sure if this would relate to Blaze. I think selecting good projection using fields is good enough there?

mitar commented 7 years ago

On this topic, I will copy my comment from the other thread:

This is a trade-off. Deep-diffing documents costs much more than top-level diffing. So it might optimize how much data is transferred over the wire, but then you have a bigger load on the server and higher latency. Also subdocuments tend to be smaller, so there is not really so much more data going over the wire often, especially with compression available in Meteor 1.2.

So I think trade-off is good for a general case. I think adding more complexity would make things worse. And as a data consumer you can always use fields parameter to find call to limit which subdocuments you are interested in and in this way limit what you get as data and what is reactively dependent. But what is send over the wire is probably not so important and is good as it is.

mitar commented 7 years ago

One other issue with deep merging is that DDP would have to be extended. We would need extra metadata to know how to merge subfields. For example, if you have:

{
  field: [{
    foo: 'bar'
  }, {
    foo: 'baz'
  }]
}

And you have two publish functions, one with fields equal to field: {$elemMatch: {foo: 'bar'}} and another with field: {$elemMatch: {foo: 'baz'}}, how can the client currently merge them? It is not possible, because there is no information which element in the array a subdocument on the client is. Client sees only:

{
  field: [{
    foo: 'bar'
  }]
}
{
  field: [{
    foo: 'baz'
  }]
}

ramezrafla commented 7 years ago

Thanks @mitar We have seen big docs being sent over the wire for minor changes. For example, the user data which has a lot of fields, and a single sub field changed. The whole thing is resent so many times.

In terms of merging, I would either send down { 'field.$0.foo' : 'bar' }

Or the result of the diff library: { kind: 'E', path: [ 'top','sub' ], value: 'updated object' } (where 'E' = Edit, 'A' = Add, etc.) see https://github.com/flitbit/diff

mitar commented 7 years ago

Yes, a larger grammar would be needed. The issue is that already in the publish function you loose that information you have only added, changed, removed. And that this extra information has to come over the wire. And then be merged on the client.

Maybe we could extend that values of changed call for example are not raw fields, but such extended language having more information would then have to be supported by both DDP and client side merging. I think this would require a lot of changes to the whole stack. I think with MDGs focus on Apollo this is pretty hard to imagine.

But I do think this could be done relatively easily with non-core package. I have not had this problem so I have not made a package for it, but if I would be asked to do it, I would do it in the following way.

In the publish, publish all fields flattened into Mongo filed selectors. So something like:

const handle = Collection.find({...}).observeChanges({
  added(id, fields): {
    this.added('collection', id, flatten(fields));
  },
  changed(id, fields): {
    this.added('collection', id, flatten(fields));
  },
  removed(id): {
    this.removed('cllection', id);
  }
});

Where flatten converts object like:

{
  _id: 'foo',
  field: {
   a: 'bar',
   b: [{
    'c': baz'
   }]
}

Into:

{
  _id: 'foo',
  'field.a': 'bar',
  'field.b.0.c': 'baz'
}

Then on the client, you can do something like:

const DeflattenCollection = new Mongo.Collection(null);
Collection.find({}).observe({
  added(doc): {
    DeflattenCollection.insert({_id: doc._id});
    DeflattenCollection.update({_id: doc._id}, {$set: _.omit(doc, '_id')});
  },
  changed(doc): {
    DeflattenCollection.update({_id: doc._id}, {$set: _.omit(doc, '_id')});    
  },
  removed(doc): {
    DeflattenCollection.remove({_id: doc._id});
  }
});

And now DeflattenCollection collection has the data in the same structure as it is in the MongoDB database.

Not tested. :-) Probably a bit more work on the client has to be done, for example, I am not sure what would you want as a result if array elements available on the client are just at server index 0, 3, 4. Do you want the result array on the client to be three elements, or 5 elements with values at index 1 and 2 be null?

The downside of this approach is that you have data stored twice on the client side. But probably even core Meteor implementation would have to do something like that to know how to reconstruct arrays.

This would send minimal changes to fields over the wire (at the expense of longer/repetitive field names, but compression should handle that). The only case where many field updates could be send is if you have a array and a new element is added somewhere not at the end. Then fields for all values after the insertion point will be sent, instead of just a command to insert at that point.

ramezrafla commented 7 years ago

Thanks @mitar,

Here is what I developed and it works in our app. https://github.com/ramezrafla/meteor-oplog-deepdiff

We'll keep testing for edge cases. Please take a look and tell me what you think (others too!)

You are mentioned in the acknowledgments. It turns out that if you send down mongo-style selectors, you don't need to do anything client-side.

@theodorDiaconu, my guess is that we may need to replicate in redis-oplog (if we want) as this code works directly on the oplog observers. Not sure, if you have 2 mins, please take a look.

ramezrafla commented 7 years ago

First issue: https://github.com/ramezrafla/meteor-oplog-deepdiff/issues/1 any help would be appreciated as it's the internals of minimongo

mitar commented 7 years ago

BTW, also see: https://github.com/meteor/meteor/issues/4483

ramezrafla commented 7 years ago

I was able to fix the issue; client-side needed update to an internal _process_changed function in livedata to flatten variables. But it needs a hack to ddp-client/livedata-connection.js to expose an internal object (see last comment in https://github.com/ramezrafla/meteor-oplog-deepdiff/issues/1 and https://github.com/ramezrafla/meteor-oplog-deepdiff/blob/master/client.js)

Is there any other way to export the Connection variable? @hwillson and @benjamn your help would be appreciated as this as a cool feature to add to Meteor.

hwillson commented 7 years ago

@ramezrafla Sorry, I haven't had a chance to digest this full issue thread yet (or your package), but is Meteor.connection what you're looking for? See the full details in https://github.com/meteor/meteor/blob/devel/packages/ddp-client/client_convenience.js.

ramezrafla commented 7 years ago

Thanks @hwillson,

I don't think that would work. The variable we are looking for (Connection) is scoped internally to file ddp-client/livedata-connection.js (i.e. a var statement at the beginning).

hwillson commented 7 years ago

Right, but ultimately what you're trying to do is alter the behaviour of the current connection itself. Connection in ddp-client/livedata_connection.js points to a constructor function. Meteor.connection is the connection that is created by calling this constructor function. Since you're interested in overriding the default behaviour of _process_changed, you could do this directly on the current Meteor.connection. So something like:

if (Meteor.connection) {
  Meteor.connection._process_changed = function (msg, updates) {
    ...
  }
}

If you log this inside the above overridden _process_changed function, you'll see that it points to the proper Connection object.

All of this being said, _process_changed isn't part of the public connection API, so the usual caveats of use this approach at your own risk apply.

mitar commented 7 years ago

@ramezrafla: I see you are doing diffing on objects. I though this is not necessary, because if you just flatten all fields then existing Meteor logic will do diffing for you (because they will compare values of each filed). The only downside with this is how to handle removed subfields. You would probably have to read from cached values in mergebox.

If you go the route you are currently taking, then I suggest you instead of trying to send those diffs over DDP, maybe consider using streamer to send then in parallel, intercepting them as messages on the client, and populating a local collection. It might be easier, but you might also have need to re-implement latency compensation and other things yourself.

Maybe onMessage hook is what you need? https://github.com/meteor/meteor/pull/7985

BTW, I do not think this has anything to do with oplog. This should be completely independent from oplog. You could do this at the publish level. Just patch publish.changed. You can use this package to make it easier. And then you can access currently published document in @_session.getCollectionView(collectionName).documents[stringId] and you can diff new changes against that and then do something with that.

ramezrafla commented 7 years ago

@mitar, thanks for your message. Will read it carefully.

Here are my initial thoughts: there is actual diff-ing already happening and it takes place in the mongo library, in the oplog observer. All I did was swap with a dot-based diff that keeps only what needs vs using DiffSequence. It does the diff-ing with the cached mergebox value.

I'll take a look at streamer, but that means adding more libraries into meteor, I am trying to be minimal.

ramezrafla commented 7 years ago

@hwillson good advice, so we ended up overriding DiffSequence.applyChanges which needed to support dot notation (it's BC), this reduced the need to override many core files.

@hwillson and @mitar, We are left with one place where we need to change a core function. If you look at https://github.com/ramezrafla/meteor-oplog-deepdiff/blob/master/hacked-core/mongo/collection.js#L172 you'll see how we handle removed array elements in the update function. That function gets replicated in all collections.

I tried to do it manually in https://github.com/ramezrafla/meteor-oplog-deepdiff/blob/master/client.js (look at commented section) but failed miserably (using sample collection 'content' in our app to test it).

This should be the last step as it works in staging. Then we can release package. I would very much appreciate help to close this.

Thanks

ramezrafla commented 7 years ago

We are close to the end. We have one last bug. Not sure if we need to open a ticket on main meteor repo. https://github.com/ramezrafla/meteor-oplog-deepdiff/issues/3

This bug relates to clearing data, the message is not sent from the server (even though diff-ing tells the server to clear this field). Seems it's related to how ddp-server batches requests. Any help please?

javascriptlove commented 4 years ago

Any update on this?

haveaguess commented 3 years ago

This killed a week for our team, any update?

meteor / meteor-feature-requests

Support merging subfields in DDP (or separate multiple subscriptions instead of merging) #67