olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.96k stars 548 forks source link

TypeError When Searching a Previously Serialized Index #18

Closed DannyNemer closed 11 years ago

DannyNemer commented 11 years ago

I found the bug. You can read my explanation in the first comment below.

When I search a previously serialized index, having loaded it using lunr.Index.load(), I receive a TypeError (shown in detail below). The following are steps to reproduce this error:

First, I create an index, add a document to the index, and log the index to ensure it is working:

var index = lunr(function () { //create index
    this.field('title')
    this.ref('id')
})
index.add({id: 1, title: 'apple' }) //add a document to index
console.log(index) //log index to demonstrate proper functionality

The following is the result of the console.log(index). As you see, everything is working correctly:

{ _fields: [ { name: 'title', boost: 1 } ],
  _ref: 'id',
  pipeline: { _stack: [ [Object], [Object] ] },
  documentStore: { store: { '1': [Object] }, length: 1 },
  tokenStore: { root: { docs: {}, a: [Object] }, length: 1 },
  corpusTokens: { length: 1, elements: [ 'appl' ] } }

I then serialize the index and store it in my database:

user.index = index.toJSON() //serialise index and store it in the database
user.save() //save changes to database

Later, I load the previously serialised index and log the index to ensure it is working:

index = lunr.Index.load(user.index) //load previously serialised index from database
console.log(index) //log index to demonstrate proper functionality

The following is the result of the console.log(index). As you see, everything is working correctly and is identical to before:

{ _fields: [ { name: 'title', boost: 1 } ],
  _ref: 'id',
  pipeline: { _stack: [ [Object], [Object] ] },
  documentStore: { store: { '1': [Object] }, length: 1 },
  tokenStore: { root: { docs: {}, a: [Object] }, length: 1 },
  corpusTokens: { length: 1, elements: [ 'appl' ] } }

Again, I will add a document to the index and log the index to ensure it is working after having loaded the previously serialized index:

index.add({id: 2, title: 'banana' }) //add a second document to index
console.log(index) //log index to demonstrate proper functionality

The following is the result of the console.log(index). As you see, everything is still working correctly:

{ _fields: [ { name: 'title', boost: 1 } ],
  _ref: 'id',
  pipeline: { _stack: [ [Object], [Object] ] },
  documentStore: { store: { '1': [Object], '2': [Object] }, length: 2 },
  tokenStore: { root: { docs: {}, a: [Object], b: [Object] }, length: 2 },
  corpusTokens: { length: 2, elements: [ 'appl', 'banana' ] } }

Here is the problem. I will now attempt to search the index for a document that was added before serializing the index ({id: 1, title: 'apple' }):

index.search('ap') //attempt to search index for document added before serialising index

This is there error I receive:

TypeError: Cannot read property '0' of undefined
    at lunr.TokenStore.getNode (/home/danny/node_modules/lunr/lunr.js:1467:18)
    at lunr.TokenStore.get (/home/danny/node_modules/lunr/lunr.js:1491:15)
    at lunr.Index.documentVector (/home/danny/node_modules/lunr/lunr.js:908:30)
    at null.<anonymous> (/home/danny/node_modules/lunr/lunr.js:880:61)
    at Array.map (native)
    at lunr.SortedSet.map (/home/danny/node_modules/lunr/lunr.js:453:24)
    at lunr.Index.search (/home/danny/node_modules/lunr/lunr.js:879:6)

This is the code in its entirety:

var index = lunr(function () { //create index
    this.field('title')
    this.ref('id')
})
index.add({id: 1, title: 'apple' }) //add a document to index
console.log(index) //log index to demonstrate proper functionality

user.index = index.toJSON() //serialise index and store it to the database
user.save() //save changes to database

index = lunr.Index.load(user.index) //load previously serialised index from database
console.log(index) //log index to demonstrate proper functionality

index.add({id: 2, title: 'banana' }) //add a second document to index
console.log(index) //log index to demonstrate proper functionality

index.search('ap') //attempt to search index for document added before serialising index

Note: If, at the final line, I searched the index for 'ba', it would correctly return {id: 2, title: 'banana' }, which was added after loading the index. Also, if I pass a query that normally would not return any result, such as 'da', it would correctly not return anything, without error.

Thank you very much.

I found the bug. You can read my explanation in the first comment below.

DannyNemer commented 11 years ago

Found the bug!

As demonstrated below, when loading a serialized index, lunr.js incorrectly loads the serialized index's documentStore.

First, create an index, add a document to the index, and log the index.documentStore.store to show its correct form:

var index = lunr(function () { //create index
    this.field('title')
    this.ref('id')
})
index.add({id: 100, title: 'apple' }) //add document to index
console.log(index.documentStore.store) //log documentStore.store before serializing index

The following is the log of the index.documentStore.store in its correct form:

{ '100': { length: 1, elements: [ 'appl' ] } }

Then, serialize the index and store it to the database:

user.index = index.toJSON() //serialize index and store it in the database
user.save() //save changes to database

Load the serialized index and log the index.documentStore.store to show it in its incorrect form:

index = lunr.Index.load(user.index) //load previously serialized index from database
console.log(index.documentStore.store) //log incorrect documentStore after loading serialized index

The following is the log of the index.documentStore.store in its incorrect form:

{ '100': { length: 1, elements: { length: 1, elements: [Object] } } }

Bug: The bug occurs on line 962 of lunr.js in the lunr.Store.load() function, which is called in lunr.Index.load(). Simply, this is fixed by changing

memo[key] = lunr.SortedSet.load(serialisedData.store[key])

to

memo[key] = lunr.SortedSet.load(serialisedData.store[key].elements)
olivernn commented 11 years ago

Thanks for a very detailed bug report, it makes it much easier to try and get to the bottom of what is going wrong! Could you let me know what version of lunr you are using, both for serialising and loading the index? You can get this using lunr.version.

DannyNemer commented 11 years ago

I'm using version 0.3.0. But as you can see, I already found the bug in my follow-up comment and I submitted a pull request with the fix.

olivernn commented 11 years ago

Yep, I saw, just trying to re-produce the error locally myself.

olivernn commented 11 years ago

Ok, I've looked into this some more and I think it is an issue with how you are serialising/loading the index. The index object provides a toJSON method, this is provided so that when the index object is serialised using JSON.stringify the whole index is correctly serialised as JSON.

Using JSON.stringify will handle recursively correctly serialising each object to JSON. By directly calling toJSON on the index you are bypassing this. For most cases this doesn't matter, however in the case of the lunr.Store it's toJSON method is not calling toJSON on every object in the store.

lunr.Store.prototype.toJSON = function () {
    return {
        store: this.store,
        length: this.length
    }
}

The issue is then that index.toJSON does not return the same as JSON.parse(JSON.stringify(index)), I'm not sure this is a major issue, although I don't know your use case…

There are a couple of ways this could be fixed, both would be just trying to achieve more consistency.

DannyNemer commented 11 years ago

I see. Thank you for your explanation.

As is, index.toJSON() calls the following function:

lunr.Store.prototype.toJSON = function () {
    return {
        store: this.store,
        length: this.length
    }
}

This returns the following serialized index.documentStore:

"documentStore" : {
  "length" : 3,
  "store" : {
    "1" : {
      "length" : 1,
      "elements" : [
        "appl"
      ]
    },
    "2" : {
      "length" : 1,
      "elements" : [
        "orang"
      ]
    }
  }
}

As a result, when I use lunr.Index.load(index) later, followed by index.search('apple'), the error I explained above occurs. Am I correct to conclude that currently in lunr.js, a perviously serialized index cannot successfully be searched without receiving this error? I ask because you mentioned that this might be dependent on my use case.

For now, I have changed lunr.Store.prototype.toJSON to the following:

lunr.Store.prototype.toJSON = function () {
  return {
    store: JSON.parse(JSON.stringify(this.store)),
    length: this.length
  }
}

As you explained, using JSON.stringify correctly serializes each object in index.documentStore to JSON:

documentStore" : {
  "length" : 3,
  "store" : {
    "1" : [
      "appl"
    ],
    "2" : [
      "orang"
    ]
  }
}

As a result, index.search('apple') works as intended.

Furthermore, am I correct to conclude that the reason for your implementation of toJSON, instead of simply calling JSON.parse(JSON.stringify(index)), is to call lunr.Pipeline.warnIfFunctionNotRegistered when serializing the pipeline? Other than that and the issue discussed above, are toJSON and JSON.parse(JSON.stringify(index)) identical?

Thank you very much.

olivernn commented 11 years ago

You should absolutely be able to search within a index loaded from a previously serialised index without encountering errors!

The way to correctly serialise a lunr index is to just use JSON.stringify e.g. JSON.stringify(index). This will return a JSON string representing the index. You can then do whatever you want with this string, save it somewhere, send it to a web worker, wherever you need to start up a copy of the original index somewhere else at another time.

To re-load an index you then need to parse this json and pass it to lunr.Index.load e.g. lunr.Index.load(JSON.parse(serialisedIndex)). This will return you a new index with the exact same data as you original index, you will be able to search without any errors etc, if not there is a bug in lunr (please report it!).

To conclude, you probably shouldn't need to call toJSON directly on the index, it is provided for the internals of how JSON.stringify works. Hence my question about your use case, why do you call index.toJSON directly instead of using JSON.stringify(index)? Why do you want a plain object representation of the index rather than a JSON string?

DannyNemer commented 11 years ago

I misunderstood. The reason I have been calling index.toJSON instead of JSON.stringify(index) was because index.toJSON was documented as lunr's method for returning a representation of the index ready for serialization, and has additional functionality specific to lunr that JSON.stringify does not, such as ensuring pipeline functions are registered in lunr.Pipeline.prototype.toJSON. As a result, I was under the false impression that index.toJSON was preferred. Nevertheless, I now understand that I was mistaken, and lunr works as intended. Thank you very much for your help, as well as for your wonderful work.

olivernn commented 11 years ago

No worries, the documentation around serialisation could probably do with a little clarification, I'll see if I can make it a little clearer.