is it safe to use direct-reads to partially index a 100s of millions mongo collection?

raffaeleguidi commented 6 years ago

Hi, I was reading your documentation about direct reads and I was wondering whether direct-reads are the correct approach, should I index only the last few millions of a large (100s of millions) collection. I am afraid that loading all the documents (to just pick a few of them) would hurt the production database I am attaching to. Wouldn't be good having a way to express a filter using own mongo query language (i.e. time: { $gte: '<1 month ago' } ?

rwynn commented 6 years ago

You can use a combination of the resume and resume-from-timestamp option to get only the changes since a certain time. Read the doc for the timestamp it is a 64 bit number with the high 32 bits as seconds since the epoch. You would not use direct-reads in this case.

When you use resume by itself, this timestamp will be read from Mongodb in the monstache db. So you could also update it there instead.

raffaeleguidi commented 6 years ago

But isn't that timestamp intended only to filter the oplog? The bulk of the collection is not in the oplog anymore

rwynn commented 6 years ago

Yes that only works on the oplog. Currently monstache does not support customized queries. This feature would need to be added.

It would be a good addition. I would need to first add the feature to the gtm library and then surface the configuration through monstache. Most of what monstache does in Mongodb goes through rwynn/gtm.

rwynn commented 6 years ago

You could possibly create views in Mongodb for your collections that would be based off a date query. Then you can use the view instead of the collection in your direct reads.

https://docs.mongodb.com/manual/core/views

However I’ve tried this and it requires a one line change to gtm to remove the SetCursorTimeout call since this is not allowed on views. So you would have to checkout gtm modify and go install and then build Monstache against that.

rwynn commented 6 years ago

Here is an example of what I mean:

Given a collection test in db test with a field d which is a date field. Before running monstache we run the following in a script against MongoDB to drop/create the view with the built in time range (in this case d >= 15 days ago).

rs1:PRIMARY> use test;
switched to db test
rs1:PRIMARY> db.recent.drop()
true
rs1:PRIMARY> db.createView('recent', 'test', { $match: {d: {$gte:  new Date((new Date().getTime() - (15 * 24 * 60 * 60 * 1000)))}}});
{
    "ok" : 1,
    "operationTime" : Timestamp(1533761492, 1),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1533761492, 1),
        "signature" : {
            "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
            "keyId" : NumberLong(0)
        }
    }
}
rs1:PRIMARY> db.recent.find()
{ "_id" : ObjectId("5b6b56dbf49ba7cfe6f117ac"), "d" : ISODate("2018-08-08T20:47:23.834Z") }

Then you can use the following in monstache:

direct-read-namespaces = [test.recent]

However, as I mentioned before you would need to alter the gtm source code and build monstache against this for it to work. Also, by default these would be indexed in Elasticsearch to the test.recent index. But you can override that with a [mapping] definition.

rwynn commented 6 years ago

This is the line you would need to remove: https://github.com/rwynn/gtm/blob/master/gtm.go#L927

rwynn commented 6 years ago

If you need an option that is supported with the latest release version of monstache.

direct-read-namespaces = [ "db.collection" ]

[[filter]]
namespace = "db.collection"
script = """
// set threshold to 15 days ago in ms
var delta = 15 * 24 * 60 * 60 * 1000,
    thresh = (new Date()).getTime() - delta;
module.exports = function(doc) {
    var ts = doc.createdAt;  // createdAt is a MongoDB Date() field that you've stored
    return (ts.Unix()*1000) >= thresh;
}
"""

If you store the field in MongoDB as a Date, is it surfaced to scripts as a Go time.Time object in javascript. That is why it is calling this function (https://golang.org/pkg/time/#Time.Unix).

This filter will happen in monstache and will be run against all records in the collection. The result will be that most records are thrown away (I suppose this is what you didn't want on the production database). This will obviously be much slower than had your filter been applied by MongoDB. But it's the best I can do with how monstache works currently.

raffaeleguidi commented 6 years ago

well, as a quick solution I will try to use the "filter" workaround, but attaching to a clone of the mongo cluster and see how it goes and will wait for these changes to show up in a production release, just to play it safely :+1:

Thanks a lot for your help :)

raffaeleguidi commented 6 years ago

Ok, I added the filter script to the configuration and the name of the collection in the direct reads array - the cpu obviously spins at 200% and I still cannot see anything indexed (well, it is skipping several millions documents, I expect it to be normal); only thing I see a lot of errors like these in the logs:

10/8/2018 20:20:23 ERROR 2018/08/10 18:20:23 TypeError: Cannot access member 'Unix' of undefined

any idea?

rwynn commented 6 years ago

Is it possible some of the documents do not contain the date field that is referenced? You can return false for those.

Note:my example is a fictional createdAt field but you would use some other ISODate() that you've stored in MongoDB.

direct-read-namespaces = [ "db.collection" ]
[[filter]]
namespace = "db.collection"
script = """
// set threshold to 15 days ago in ms
var delta = 15 * 24 * 60 * 60 * 1000,
    thresh = (new Date()).getTime() - delta;
module.exports = function(doc) {
    var ts = doc.createdAt;  // createdAt is a MongoDB Date() field that you've stored
    if (ts) {
       if (ts.Unix) {
          return (ts.Unix()*1000) >= thresh;
       } else {
          console.log("The createdAt field in document " + doc._id + " is not a time.Time field");
       }
    } else {
       console.log("No createdAt field found in document " + doc._id);
    }
    return false;
}
"""

raffaeleguidi commented 6 years ago

Uhm, I would exclude that - and I do not have a simple way to check it out, because the log message does not have a document _id attached, but, in any case, the indexing is going fine and it already added 3M+ documents to the index without adding too much pressure to the db (monstache is the only user, though). I will leave the issue open just to let you know how it goes but I expect a happy end ;)

rwynn commented 6 years ago

If you have success you can switch over to Go middleware instead of JavaScript. That should perform better.

https://rwynn.github.io/monstache-site/advanced/

raffaeleguidi commented 6 years ago

Go middleware [...] should perform better

Aaagh, nodejs dude here - that hurts a bit ;) Any clear evidence about that? I understand that otto should be performant and I believe these kind of operations, being heavily I/O bound should not be affected too much by the language/runtime are made on. I would expect, probably, less cpu usage, but indexing speed should not vary too much.

In any case the little guy did his job flawlessly (I will add that timestamp check, next time) and ingested 17bn documents (every document has more or less 1000 properties, I had to raise the ES field count limit) in about 12 hours. Thank you for help, advice and this amazing contribution to the open source world

raffaeleguidi commented 6 years ago

BTW you were right, the timestamp was missing (a bug on my side), I simply trycatched the return statement and logged the exception and found it out

rwynn commented 6 years ago

Well, the Otto JavaScript interpreter itself is not thread safe I’ve found and I need to acquire a lock around running these filter functions. Kinda like a Global interpreter lock. The Go plugin doesn’t require this lock so that alone would allow it to perform better.

raffaeleguidi commented 6 years ago

Uhm, I see... don't what to dig in deep about the need for a lock, but I wonder whether stdin/out could be used as a middleware. Wouldn't would be interesting to be able to pipe every record into another process (i.e. a nodejs one [:grin])?

raffaeleguidi commented 5 years ago

hey, there, @rwynn ! Any chance that the first approach you suggested did it into latest version?

rwynn commented 5 years ago

@raffaeleguidi you can now use MongoDB aggregation queries which will do the filtering on the server. Is that maybe what you mean by first approach?

https://rwynn.github.io/monstache-site/advanced/#aggregation-pipelines

https://docs.mongodb.com/manual/aggregation/

[[pipeline]]
script = """
module.exports = function(ns, changeStream) {
  if (changeStream) {
    return [
      { $match: {"fullDocument.foo": 1} }
    ];
  } else {
    return [
      { $match: {"foo": 1} }
    ];
  }
}
"""

raffaeleguidi commented 5 years ago

@raffaeleguidi you can now use MongoDB aggregation queries which will do the filtering on the server. Is that maybe what you mean by first approach?

Uhm no, I was thinking about this one: https://github.com/rwynn/monstache/issues/92#issuecomment-411550182 - using mongdb views in direct reads. You said, on 8th of august, that I would have had to comment out a line in the monstache source code to enable it, and I was wondering if it was now a supported feature.

But, regarding the pipeline approach - I guess I could use it with just $match and no data aggregations ($group, $transform), to simply filter data on both direct reads and oplog - has this feature (which seems more comprehensive) simply took over the one you mentioned in your 8th of august comment?

rwynn commented 5 years ago

You can now use a MongoDB view as a direct read namespace in the latest version.

Also you can use the pipeline approach with an aggregation query that MongoDB supports including just a $match to filter the results.

You do need to adjust the pipeline slightly for change streams since the doc will be in a field named fullDocument.

rwynn commented 5 years ago

Hopefully this link is helpful.

https://rwynn.github.io/monstache-site/advanced/#mongodb-view-replication

rwynn / monstache

is it safe to use direct-reads to partially index a 100s of millions mongo collection? #92