Serialise - Githubissues

olivernn commented 11 years ago

This branch includes early support for dumping and loading an index, the internals may change but the interface is likely to remain the same, this branch will be released as v0.3.0. I haven't had a chance to update the docs, I'll probably get around to releasing this properly in a week or so.

The index now supports serialisation via JSON.stringify. A dumped index can then be reloaded via lunr.Index.load.

I have updated the example, in /example, to use a pre-built index. The load time is significantly faster! I've also included a simple node script for pre-building the index. I imagine that something similar could be extracted into a standalone tool, but that is for another day.

Any feedback is welcome.

ssured commented 11 years ago

Hi

I've been playing with the serializer. Great stuff! I'm encountering a bug though, which I could not resolve.

See http://jsbin.com/upabeb/1 for my test case, check source. Inside is a build of the lunr.js serialise branch, followed by my dataset, followed by a minimal test script. Check console for output.

Essentially what I do is I load a big dataset. Then I search for 'lady' which returns 21 results. Searching in the serialised-deserialised index returns an empty set. Clearly some data is lost.

I saw your test case, which essentially does the same, but that one runs fine. The only difference is the bigger dataset I guess?

Also I've been looking into ways to optimize the size of the index. Currently it compiles to 900k for me. Most overhead is in the tokenStore part. I was wondering if its possible to quickly compute (a part of) the tokenStore in some way. What's your guess? A small enhancement can be made in the documentStore, instead of copying the strings, we could reference the index of the string in the corpusTokens

Great stuff!

olivernn commented 11 years ago

Hey sorry for late reply, I've been away on holiday.

That is a strange issue you are seeing, I'll take a look and see whats going on.

As for the size of the serialised index, there probably are some ways to store the tokenStore in a more efficient way, for example, several characters could be combined into a node so as to lessen the number of nodes required. I haven't looked into how achievable this is though, its just a thought.

olivernn commented 11 years ago

Okay so I think the problem you were having with the serialised branch is due to the pipeline not being serialised. This worked in my example because the pipeline was empty, I assume in your example it was not.

I'm not sure of the best way to try and handle this, functions cannot be serialised into JSON. I see a number of options:

Don't serialise the pipeline at all. This is the simplest option, it would mean that the user would have to manually set up the pipeline to be exactly the same as where the index was serialised. This is the current implementation.
Serialise the names of the pipeline functions. This would mean that each pipeline function would have to have a name or label property, which would be serialised and then 'looked-up' when loading a serialised index. To make the lookup easier pipeline functions would need to be 'registered' with lunr, so the loading process had a list of functions to look up.

Whilst the first option is the simplest, and requires the least amount of change, I think it would cause too many problems, and its not particularly obvious. I think when you load an index it should be ready to go, without any extra set-up.

Having to register pipeline functions is a bit of an overhead, for the built in pipeline functions this isn't a problem, but any one adding extra pipeline functions would need to make sure that they are registered before trying to load an index etc.

I'm leaning towards option 2, however I'm keen to get any ideas or feedback.

ssured commented 11 years ago

Hey Oliver, hope you had a good holiday :)

My preference is to use named functions too. Maybe we can have a fallback in which we supply functions encoded as strings, which will be evaled by lunr. This is a security risk, but inside closed networks (intranets) the risk is low.

olivernn commented 11 years ago

@ssured Yep, I've gone with naming the pipeline functions. They have to be registered with lunr before they can be successfully serialised. The included pipeline functions are automatically registered so this will only affect people using custom pipeline functions.

Please try out the latest version with your example again, I think it should have solved the problem you were having before.

I'm aiming to get a 0.3.0 release with this feature in a weeks time or so.

olivernn commented 11 years ago

The serialise functionality has been released in version 0.3.0 of lunr.

An example can be seen in example/index_builder.js. Basically the index can now be serialised using JSON.stringify. The output can then be loaded again using lunr.Index.load(serialisedIndex).

olivernn / lunr.js

Serialise #14