olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.87k stars 547 forks source link

how to add json stream to idx? #437

Closed hoogw closed 4 years ago

hoogw commented 4 years ago

I use oboe.js and stream-json.js to receive already parsered json object one at a time.

I need to create lunr index idx, add one json object at a time. (the whole document is not available yet until the json stream ended.

I need something like this: How to?

    `  var idx = lunr(function () {
                                              this.field('title')
                                              this.field('body')
                   // not add doc yet
                      })

          // somewhere when stream json event fired, when one json object received.

          stream-json.on( 'received-one-doc',  function (doc){               

                           // will add one object at a time, will repeat until last object reached.
                             idx.add(doc)

              }

`

hoogw commented 4 years ago

Ok, I know how, I have to use old v1.0.0 to do so.

v1.0

                      `var idx = lunr(function () {
                         this.ref('id')
                           this.field('text')
                         })

      // somewhere else, when you received 1 json from a stream.
     oboe().node('features.*', function( ___feature___ ){
                                     idx.add( one-node-json)
                         )}

`

v2.x you can not do that, In 2.x the documents are added before the end of the configuration function

          `var idx = lunr(function () {
                        this.ref('id')
                       this.field('text')

                   // at this moment you have to have whole json,  not work with oboe stream json
                   // when stream json, only 1 node received at a time, you do not have whole-json yet 
                    //until you reach stream end
                    this.add( whole-json)
                   })`
hoogw commented 4 years ago

can you make v2.x has the v1.x style of add doc?

To avoid store whole 1GB json in memory, I choose oboe to stream json, means, only 1 piece of json emit at a time, I can only use v1.x to add this one piece json to idx.

v2.x, I have to store those 1 piece of json one by one until end, add up to 1 GB in memory, at that time, I can add whole 1GB json to idx, NOT WORK, since many user's browser will crash at 1GB in memory.

But v1.x will works, since it use lot less memory, by only add 1 piece json at a time.

hoogw commented 4 years ago

I was be able to use v1.0 idx.add(doc) one by one out side of idx definition.

But, found it too slow to build index. am I right?

The reason v2.0 use immutable index is speed up building index. v2.0 fast, but you have to have whole json before you build index.

v1.0, you can add doc one at time somewhere, then add another doc later, somewhere, whatever you like, each time you add new doc, index is re-build. That cost huge CPU and memory.

Am I right?

olivernn commented 4 years ago

You are correct that the current version of lunr (2+) requires that all the documents are available at build time, it is not possible to incrementally build the index from a stream.

Lunr is an in-memory search index, so regardless of which version you are using, it will require at least as much memory as was required to store the documents before indexing, this is something worth remembering.

It is possible to build the index and then serialise it, the serialised index can then be streamed to browsers, more details on this in the guides.

The reason v2.0 use immutable index is speed up building index.
v2.0 fast, but you have to have whole json before you build index. v1.0, you can add doc one at time somewhere, then add another doc later, somewhere, whatever you like, each time you add new doc, index is re-build. That cost huge CPU and memory.

So, I think the current index building is probably slower than in previous versions, the benefit from the immutable index and having distinct build and query stages is to improve query performance as well as reduce index size. If you know all the documents that are being indexed upfront you can create more efficient and high performance indexes.