Prebuilt index documentation / schema

budziq commented 6 years ago

Hi,

Currently the rust lang community is considering adding search into our documentation sites (mostly generated and hosted as static content generated from MarkDown via https://github.com/azerupi/mdBook/)

And we are considering lunr.js as a good fit. The only problem is that current instructions for index prebuilding require js exaceution environment. We would prefer to generate the index without additional dependency. Unfortunately I cannot see any specification/docs for the index.json so we could generate it ourselves.

Could you point us in the right direction to?

olivernn commented 6 years ago

There is a repository with a JSONSchema description of the index.json file that Lunr expects. It is marked WIP, but mostly because I still haven't gotten around to using it from any other environment yet. I'm more than happy to work with you on making sure it is straightforward to integrate with the schema; let me know how I can help.

As for implementing a backend, the basics that you will need are tokenisation and (maybe) a stemmer. In both cases I'm sure it will be possible to outperform the current JavaScript implementations (since they are constrained by having to work in a browser). The backend will also have to generate the vector space for the documents, which is probably the bulk of the work.

Finally, it is worth trying to get a handle on the size of the index that will be generated. One of the reasons I created the schema definition was as a first step in getting a binary format for the index, with the hope of reducing the size of what is going over the network. Again, I'm more than willing to do more work here if it can help with the adoption on the rust documentation.

budziq commented 6 years ago

Thanks for the help @olivernn !

We'll look into it. It may take some time before someone actually starts any development but we'll be sure to reach out if we'll have any questions :smile:

olivernn commented 6 years ago

I've read through the mdBook issue, one thing you might consider is getting tantivy to do the hard work of building the index etc, and figuring out a way to export what it generates in the schema that Lunr expects. I'm not at all familiar with tantivy so don't know how possible that is, but this approach would reduce the problem to just writing some serialisation code.

budziq commented 6 years ago

@olivernn Thanks! I'll look into tantivy and possibly reach-out to them (it's build with rust nightly only but that could be fixed)

drzraf commented 6 years ago

How to know the documentCount in case of prebuilt indexes? (there is no builder involved)

olivernn commented 6 years ago

@drzraf why do you want to know the number of documents?

Depending on the implementation it might be possible to infer the number of documents since the serialised index includes the list of fields and there is a fieldVector per field per document. I say it depends on the implementation because if a document does not contain a field it might not appear in the field vectors.

drzraf commented 6 years ago

In the context of automatic tests I build (OR load) the index,

One of my tests ensures index contains more than X documents. If index is built I can use the builder, but if loaded I'm not able to ensure the index is valid.

olivernn / lunr.js

Prebuilt index documentation / schema #299