toystars / node-elasticsearch-sync

ElasticSearch and MongoDB sync module for node
MIT License
17 stars 11 forks source link

node-elasticsearch-sync Build Status Coverage Status

ElasticSearch and MongoDB sync module for node

What does it do

elasticsearch-sync package keeps your mongoDB collections and elastic search cluster in sync. It does so by tailing the mongo oplog and replicate whatever crud operation into elastic search cluster without any overhead. Please note that a replica set is needed for the package to tail mongoDB.

How to use

npm install node-elasticsearch-sync --save

Sample usage (version >= 1.0.0)

After adding package to node app. Only ENV_VARs can be used as errors will be thrown if all required ENV_VARs are not defined.

// initialize package as below
var ESMongoSync = require('node-elasticsearch-sync');

let transformFunction = (watcher, document, callBack) => {
  document.name = document.firstName + ' ' + document.lastName;
  callBack(document);
}

let sampleWatcher = {
  collectionName: 'users',
  index: 'person',
  type: 'users',
  transformFunction: transformFunction, // can be null
  fetchExistingDocuments: true,
  priority: 0
};

// the "collectionName" and "type" fields in watchers MUST be the same. This might change in later versions.

let watcherArray = [];
watcherArray.push(sampleWatcher);

// The following env_vars are to be defined. Error will be thrown if any of the env_var is not defined 
export MONGO_OPLOG_URL="mongodb://127.0.0.1:27017/local" // mongoDB url where data will be pulled from
export MONGO_DATA_URL="mongodb://127.0.0.1:27017/db-name" // mongoDB oplog url which is the local DB of replica-set
export ELASTIC_SEARCH_URL="localhost:9200" // ElasticSearch cluster url
export BATCH_COUNT = 100; // Number of documents to be indexed in a single batch indexing

ESMongoSync.init(watcherArray, null);

/*
 * The init function takes two (2) arguments in all, as follows
 * 1. Array of wather objects specifying which mongoDB collections to pull from and keep in sync with ES cluster
 * 2. ELasticSearch object (can be null) - an already defined ElasticSerach object (returned from elasticsearch cluster connection) can be passed into the init() function to ensure that the same
 *    object used in cluster connection is used in package. This only reduces the number of connections to elasticsearch by one and might offer no practical performance engancement.
 *    If null is passed, then node-elasticsearch-sync will create and maintain its own internal elasticsearch object and uses that for data pull and real-tim sync.
 *    It is recommended that "null" is passed if the above explanation is not elaborate enough...
 */

All other configurations are as they were in previous versions.

More usage info

Watchers

Below is more info about sample watcher:

let sampleWatcher = {
  collectionName: 'users',
  index: 'person',
  type: 'users',
  transformFunction: transformFunction,
  fetchExistingDocuments: true,
  priority: 0
};

Logging

This package uses the debug library for logging. You can enable debugging by setting the DEBUG environment variable:

# enable all debugging (including other packages)
set DEBUG=*
# run the program to be debugged as usual
npm start

# enable basic debugging for this package
DEBUG=node-elasticsearch-sync:info npm start

# enable basic debugging + errors for this package (recommended setting)
DEBUG=node-elasticsearch-sync:info,node-elasticsearch-sync:error npm start

# enable all debugging for this package
DEBUG=node-elasticsearch-sync:* npm start

# enable all debugging for this package except the oplog stuff
DEBUG=node-elasticsearch-sync:*,-node-elasticsearch-sync:oplog npm start

The following debuggers are available:

Extra APIs

Reindexing

If you have a cron-job that runs at specific intervals and you need to reindex data from your mongoDB database to ElasticSearch cluster, there is a reIndex function that takes care of fetching the new documents and reindexing in ElasticSearch. This can also come in handy if there is an ElasticSearch mappings change and there is a need to reindex data. It should be noted that calling reIndex overwrites previously stored data in ElasticSearch cluster. It also doesn't take into consideration the size of documents to reindex and ElasticSearch cluster specs.

ESMongoSync.reIndex();

MongoDB and Oplog connection destruction

If for any reason there is a need to disconnect from MongoDB and MongoDB Oplog, then the destroy or disconnect functions handle that.

// completely destroy both conenctions
ESMongoSync.destroy();

// only disconnect from MongoOplog.
ESMongoSync.disconnect();

To resume syncing after mongoDB and Oplog connection has been destroyed or stopped,

ESMongoSync.resume();

will reconnect to Mongo and Oplog and re-enable real time Mongo to ElasticSearch sync.

Dynamic watchers

You can dynamically add watchers by calling:

ESMongoSync.addWatchers(watchersArray);
// this takes an array of watchers as argument, even if only 1 watcher is to be included.
// note that existing documents won't be pulled for the new watcher, but oplog activities will be tailed in real time.

Sample usage (version <= 1.0.0)

After adding package to node app

var ESMongoSync = require('node-elasticsearch-sync');

// initialize package as below
let finalCallBack = function () {
  // whatever code to run after package init
  // .
  // .
}

let transformFunction = function (watcher, document, callBack) {
  document.name = document.firstName + ' ' + document.lastName;
  callBack(document);
}

let sampleWatcher = {
  collectionName: 'users',
  index: 'person',
  type: 'users',
  transformFunction: transformFunction,
  fetchExistingDocuments: true,
  priority: 0
};

let watcherArray = [];
watcherArray.push(sampleWatcher);

let batchCount = 500;

ESMongoSync.init('MONGO_URL', 'ELASTIC_SEARCH_URL', finalCallBack, watcherArray, batchCount);

Using environment variables

While it is possible to supply mongoDB and elastic search cluster URLs as parameters in the init() method, it is best to define them as environment variables MongoDB url should be defined as: process.env.SEARCH_MONGO_URL, while Elastic search cluster url: process.env.SEARCH_ELASTIC_URL Supplying the URLs as environments variables, the init method can be called like so:

ESMongoSync.init(null, 'null, finalCallBack, watcherArray, batchCount);

More usage info (version <= 1.0.0)

The elasticsearch-sync package tries as much as possible to handle most heavy lifting, but certain checks has to be put in place, and that can be seen from the init function above. The ESMongoSync.init() takes five parameters:

Below is more info about sample watcher:

let sampleWatcher = {
  collectionName: 'users', 
  index: 'person',
  type: 'users',
  transformFunction: transformFunction,
  fetchExistingDocuments: true,
  priority: 0
};

Dynamic watchers (version <= 1.0.0)

You can dynamically add watchers by calling:

ESMongoSync.addWatchers(watchersArray);

// this takes an array of watchers as argument, even if only 1 watcher is to be included.
// note that existing documents won't be pulled for the new watcher, but oplog activities will be tailed in real time.

Sample init (version <= 1.0.0)

Still confused? Get inspired by this Sample Setup

Contributing

Contributions are welcome and will be fully credited.

We accept contributions via pull requests on Github.

Pull Requests

Issues

Check GitHub issues for current issues.

Credits

License

The MIT License (MIT). Please see LICENSE for more information.