percolatestudio / publish-counts

Meteor package to help you publish the count of a cursor, in real time
https://atmospherejs.com/tmeasday/publish-counts
MIT License
200 stars 45 forks source link

big collections redux #14

Open javaknight opened 10 years ago

javaknight commented 10 years ago

I need this to watch a big collection, but I don't want to actually publish the big collection to the client. I just need the count of the collection to be published, and if that count changes, then I need the change to be reactive and sent to the client.

Currently when I fire this up on a large collection, it just crashes my server after showing me the initial count.

dburles commented 10 years ago

hey @javaknight do you have any more info on the crash? and how many records are in the collection?

tmeasday commented 10 years ago

If the collection is truly large, it might be better to do something a little different:

Meteor.publish('count', function() {
  var self = this, first = true;
  var count = function() {
    var thisCount = Collection.find().count();
    if (first) {
      self.added('counts', 'X', {count: thisCount});
    } else {
      self.changed('counts', 'X', {count: thisCount});
    }
    first = false;
  }
  var timeout = Meteor.setInterval(count, 1000); // every 1s
  count();
  self.ready();

  self.onStop(function() {
    Meteor.clearTimeout(timeout)
  });
});

Of course ideally you'd share the timeout between multiple users subbing to the same publication. Sounds like a whole package of it's own :)

tmeasday commented 10 years ago

(This package pulls down an caches the _id from every record. If there are a lot of them, this is a terrible idea. But it allows it to be truly realtime).

chhib commented 10 years ago

@tmeasday: I think you mean Meteor.setInterval instead of Meteor.setTimeout.

tmeasday commented 10 years ago

Ahh, thanks @chhib - I updated the code so people aren't confused.

colllin commented 9 years ago

@tmeasday Why is it necessary to cache the _id from every record? Compared to just incrementing on added and decrementing on removed?

tmeasday commented 9 years ago

@colllin if you are talking about livedata's added and removed--well it'll need to cache the _id to work properly anyway. The underlying reason is basically timing issues on the oplog -- if the server sees an oplog inserted message it needs to check that it hasn't already counted that document (thus the cached _id) otherwise there are edge cases in which double counting could happen.

That's my understanding of it anyway. Possibly someone could figure out a way to use the low-level oplog driver and deal with these issues, not sure.

colllin commented 9 years ago

@tmeasday Yes, that's what I was talking about. I didn't realize observe()ing added and removed documents was imperfect (could send duplicate events)... interesting. Thank you for the explanation.

tmeasday commented 9 years ago

To be clear it's the oplog that is imperfect (I think there are a bunch of issues around the exact timing of doing your initial query vs where you start observing the oplog from).

.added() and .removed() in livedata are "perfect", but have the aforementioned performance caveat (you don't want to do them on a huge cursor).

jchristman commented 9 years ago

@javaknight, I had this same problem because my collection at 100,000+ rows - I am implementing a "scrollbox" that loads a sliding window over a collection to emulate the browser loading the entire collection. I implemented the solution @tmeasday posted above at https://github.com/jchristman/meteor-collection-scroller/blob/master/lib/collections.js if you wanna check it out (also at http://scroller.meteor.com). Atmosphere Link

faceyspacey commented 9 years ago

@tmeasday regarding your setInterval example, instead couldn't you just make the observer based on a cursor that finds a limit of one row, sorted by newest to oldest, and then increment the count only when needed rather than by interval. And of course call Collection.find().count() just once at the beginning. And then set the removed observer as usual. You'd just need to accept the collection as an argument instead of a cursor, perhaps a collection plus selector plus dateColumn.

Counts.publish = function(self, name, collection, selector, dateColumn, options) {
     var sort = {};
     sort[dateColumn] = -1;
     var count = collection.find(selector, {sort: sort, limit: 1}).count();

     var observers = {
      added: function(id, fields) {
        count += 1;
        if(!initializing) self.changed('counts', name, { count: count });
      },
      removed: function(id, fields) {
        count -= 1;
        self.changed('counts', name, { count: count });
      }
    };

    //etc
};
tmeasday commented 9 years ago

@faceyspacey - Seems like a good idea for collections where you do have a date field to work with.

I'm not sure that the removed will work however? What if I remove a document that isn't the latest?

faceyspacey commented 9 years ago

is there really no way for meteor's observers to skip calling all the added handlers on first run. Like a way internal to how meteor's observeChange's work. It seems everyone is doing the !initializing thing. ...I guess another cursor without a limit could be created just for the removed observer. Collection.remove could be overwritten to somehow notify this code--obviously that won't address direct changes to the mongo collection outside meteor code. The first solution seems fine to me. Whatchu think?

tmeasday commented 9 years ago
  1. Well you'll have the problem of a huge cursor again. Which means an unacceptably large data set cached on the server.
  2. Nope, doing anything off the oplog isn't going to work if you horizontally scale.
faceyspacey commented 9 years ago

then i guess overwriting collection.remove is the only answer, coupled with a rest API endpoint to ping if you remove rows outside of meteor. You just have to ping that API every time you directly remove rows. For me--and I'm willing to bet the vast majority of meteor developers--we wouldn't even need that. Maybe just a simple reset() method to call from time to time.

faceyspacey commented 9 years ago

so I guess collection.remove would store in another collection the name of the collection (only if a count was published for the collection). No more than the collection name would need to be stored. And then in Count.publish we just observe this collection for new added documents (selecting only documents that have the appropriate collection name), and when found decrement the count. We would also remove the row from this auxiliary collection after we decrement the count so it too is not very large (never more than one row lol).

tmeasday commented 9 years ago

@faceyspacey if you are going to think about wacky solutions like this, I'd suggest just denormalizing the count somewhere.

faceyspacey commented 9 years ago

well then just resetting the count on removes would be the solution. using a counts collection with the count from one publication denormalized into one row there.

emmanuelbuah commented 9 years ago

@tmeasday using setInterval can also be improved by keeping track of the previous count and only sending data to the client if the current (thisCount) is different from the previous count.

emmanuelbuah commented 9 years ago

Knowing the current limitation of oplog in combination with the exisitng observer api, I think the best solution for scale is to compute and store counts (in a mongodb collection or in the relevant doc) on insert and remove. Ex. On adding or removing comments from a post, update comments counter storage (possible on post, ie. post.commentsCount). This might look silly but works and scale very well.

Slava commented 9 years ago

I think we could make this better if we do the following steps:

sean-stanley commented 8 years ago

Just one point I'm a bit unclear on. If I have a large collection but only want to count a small number of these (like unread notifications for a particular online user not all notifications for all users). Then I am only caching the documents in the cursor right not the entire collection so this package would work very well for counting small numbers of things.

However I suppose if I had 500 online users each with only 10 unread notifications I'd still be caching 5000 documents right?