Support Http Pouch on Node

jfgirard commented 10 years ago

Hi,

I was looking for a full text index that works on both the browser (offline) and Couchdb for a while. My actual setup is to use a PostgreSQL cache with Fulltext index on the server and a primitive search using a simple map function in Pouch.

Your solution using Lunr is much better. I would use it on the server too but I don't want to duplicate my data from Couchdb to Pouchdb (it can be very large and that is why i want to dump the PG index).

I did some tests and manage to create a Couchdb Map/Reduce view using CommonJS (in my fork repo https://github.com/jfgirard/pouchdb-quick-search). I added the required libs (lunr + stemmerSupport, lurn-LANG if needed) with a tweaked version of your map function. All 35 tests passes (with the added missing stale option https://github.com/pouchdb/mapreduce/pull/197) using TEST_DB=http://localhost:5984/quick-search.

Is it something you want to add to your code ?
If yes, I can make a PR... But I had to make some changes to hook the code to add / remove the design documents in Couchdb. With your help, it can be better done. Also, it works only with Pouch in nodejs and read the the libs from couchdb_libs folder.

Jeff

nolanlawson commented 10 years ago

This is some pretty cool work you've done here, but I have to admit that this really wasn't my intention for this plugin. I feel like using Lunr in a server environment is a dead-end, because there are so many other good options when you're not restricted to the browser (CouchDB-Lucene, Cloudant Search, Solr, ElasticSearch, Postgres as you say, etc.). Deduplication of code is important, but I think your fix buys a very small deduplication (a dozen lines of code maybe?) in exchange for the very icky situation of using Lunr on the server. "Full-text search" is not a fungible commodity, and Lucene does it way better than Lunr does, having been around longer and having had the input of domain experts in natural language processing, not to mention domain experts in the various languages. So my goal with this plugin was not to replace Lucene (way too hard), but rather to create a "good enough" FTS that could run in any browser. (Web SQL FTS would have been better, but alas.)

As for the use case of only having this run in Node, I'm kinda perplexed by that. If you're already running in a server environment, and thus don't have an "offline" state, then why not just use CouchDB-Lucene or any of the other server-side FTS libraries mentioned above? It's not like you need to switch back and forth between Couch and Pouch when you go offline.

As for the design documents: yeah, that was a hack I put in because I dislike design documents. I think the fact that you need to manually delete and recreate design documents might indicate how bad of a solution this plugin is for anything server-side.

As an alternative, you may want to look into reviving the stillborn original PouchDB search plugin. The goal of this plugin was to directly mimic Cloudant's search (and CouchDB-Lucene, since they share the same structure of the design documents). Currently the only thing that makes it nonviable is that it reads in every document for every query, since persisted map/reduce did not exist at the time it was written.

jfgirard commented 10 years ago

Thanks for the feedback. I ll check Pouchdb seach plugin.

I tried Couchdb-Lucene before settle on PG FTS . They work pretty well but have shortcomings. For example, C-L keep open one connection for each database it query and the java runtime to run it (default is Jetty I think) eat a lot of resouces. It takes a lot of disk space too.

There are main 2 reasons to do it:

Simplify the server side logic by removing the need of a PG server or other FTS service. No more data to sync, processes to keep running, ... Just Couchdb and a node server.
Have seamless experience (and query results) whether of not the user is "online" or not. Its not a good experience if I do a seach "offline" (hit pouchdb) vs "online" (hit server side index) and get different results.

I understand that having a Pouchdb on the server would not scale well (using LevelDB). But with the change I made, the hard work is done by Couchdb itself. The map function is all in Couchdb scope... Pouchdb, using http adapter, only parse the query string, geneate the URLs for queries and return the results.

I may very well stay with Postgres FTS... I ll do tests with big datasets and see how it compares.

Jeff

nolanlawson commented 10 years ago

Ah okay, I understand better now how this works. I actually kinda forgot that CouchDB can load modules CJS-style within the map function. So actually, yeah, this is kinda neat.

I still think that server-side Lunr is the wrong solution, though. Consider the offline vs. online scenario you present: yes, it is a bad experience if the user gets different results while offline vs. online, but I don't believe the solution is to cripple the online experience to match the offline experience. Also, CouchDB's eventual consistency guarantees that the server-side results might occasionally be different from the client-side results, regardless of whether or not they're using the same algorithm.

That being said, if you have success vs. PG and you can write up an explanation in the README to explain the usage, then I'll consider adding it. But I'm reluctant to steer people towards using a poor man's FTS in the server when there are so many better options.

jfgirard commented 10 years ago

cool! I will update the issue once I know more about the performance and accuracy.

jfgirard commented 10 years ago

I spend some times comparing different solutions.

Couchdb + Nodejs: Node server using Pouchdb-quick-search (http) with map function and its libs in Couchdb.
My current solution using Postgresql FTS fronted by a node server.
The well known Couchdb-Lucene

I used 2 test databases:

5000 docs (small)
305 000 docs (large)

Build index: The first query, just after loading the data. Query: One simple query Index Size: Size on disk (megabytes) Siege: Test under load with siege command (10 concurrent queries, 6 times)

screen shot 2014-07-17 at 8 46 58 am

Solution 1:

While being a bit slower, it is still good.
With the incoming Clustering Couchdb in version 2.0, this solution scales well. Sharding of the docs and parallel couchjs processes (on multiple nodes) to build the index. The 2 other solutions need a custom logic to support sharding/clustering...
Very simpler server side logic
Search results are similar to other solutions for simple queries. Obviously, PG and C-L have much more features and options.

My feeling is that for a simple, basic, fulltext solution, it could work. Furthermore, both Lunr and Quick-Search may have more features and options in the future.

Does it worth a PR ?

jfgirard commented 10 years ago

At last, I think it makes more sens to have it as a plugin. I ll try that approach in the coming days.

nolanlawson commented 10 years ago

@jfgirard That's awesome that you did all this research. :)

Although from your data, it looks like quick-search consistently uses more storage and is slower than the other options – not to mention being much more naïve. That would mean that the primary benefit of your PR is that users can use the same code on both client and server (which is not without value!).

The main thing I wouldn't be happy about is the fact that you need to hack a _design document in order to get my "design doc-less" strategy to work. (I used that strategy only because I expected this module to remain local-only.) Also I'm not sure I understand how exactly to install this on CouchDB so that I can test it.

If you could provide an update to the README to explain how to use your version, I'll consider it. I'm starting to wonder if @calvinmetcalf wasn't right and I should have just made this compatible with Cloudant search/CouchDB-Lucene, though...

jfgirard commented 10 years ago

Indeed, its a basic solution but it performs surprisingly well, even under load.

I don't want you to feel unhappy about your code with design docs ;-) So, I'm in the process of creating a plugin that take care of all the design doc stuff.

In order to that, I need to hook in the search api.

I'm not sure about the best way to do it. Right now, I use:

 exports.search = utils.toPromise(function (opts, callback) {
    if (this.type() === 'http') {
        if (this._searchHttp) {
          this._searchHttp(opts, callback);
        } else {
           callback({
               error: 'http search not supported'
           });
       }
      } else {
        this._search(opts, callback);
      }
    });

Where _searchHttp is function from my plugin.

I added this function to include a plugin:

exports.searchPlugin = function (obj) {
  Object.keys(obj).forEach(function (id) {
   exports[id] = obj[id];
 });
};

Thinking about it, maybe I should have used Pouchdb.plugin to include it ? That's a bit strange to have a plugin on the search plugin...

And add this to expose some functions to make it possible to build the design doc map function.

exports.searchPluginSupport = {
   getText: getText,
   isFiltered: isFiltered,
   genPersistedIndexName: genPersistedIndexName,
   toFieldBoosts: toFieldBoosts
};

See this commit for a complete list of changes: https://github.com/jfgirard/pouchdb-quick-search/commit/a181d47b7bd409bd52a6235fa292b346fb24fa95

And here my WIP for the plugin https://github.com/jfgirard/pouchdb-quick-search-http-plugin/blob/master/index.js

The usage is:

 var Pouchdb = require('pouchdb');
 var QuickSearch = require('pouchdb-quick-search');
 var HttpPlugin = require('./index.js');

 QuickSearch.searchPlugin(HttpPlugin);
 Pouchdb.plugin(QuickSearch);

 var db = new Pouchdb('http://localhost:5984/test');
 db.search({q: 'bar', fields: ['foo']})

I would like to have your opinion about it before doing more work.

One cool thing about the plugin pattern is you can choose the plugin that match your needs. For example, a future "Couchdb-Lucene" plugin could handle a HTTP request differently by sending it to C-L server instead of Couchdb (mine).

Thanks!

nolanlawson commented 10 years ago

I think you mean to reverse the order here:

PouchDB.plugin(QuickSearch);
PouchDB.plugin(HttpPlugin);

This way, your HttpPlugin can simply rename db.search to db._search and define its own db.search, as you suggested.

Let me take some time to think about this. I'm still pretty uncomfortable with encouraging people to use this plugin on the server side, but if you create a plugin-on-a-plugin, you have essentially created a fork, and it may contribute to user confusion. (Although this is the beauty of open-source; we can disagree and still peacefully coexist. :smiley:)

Could you please open a formal PR with a single commit containing all your changes (not the plugin-on-a-plugin style, but a true PR), including a change to the README to explain how to use it? If it's not too intrusive, I will strongly consider it in order to avoid the pain of forking. Otherwise I will tell you that I disagree, and you can create your plugin-on-a-plugin. Sound good?

jfgirard commented 10 years ago

Yep, I ll create a PR for it. By the way, thanks a lot for your time and help!

nolanlawson commented 10 years ago

No prob, thanks for pushing this to the limit. :)

nolanlawson commented 9 years ago

See https://github.com/nolanlawson/pouchdb-quick-search/pull/7#issuecomment-65953905, sorry for the late response.

pouchdb-community / pouchdb-quick-search

Support Http Pouch on Node #2