olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.89k stars 545 forks source link

Slow indexing performance with documents with metadata #343

Open giuliac89 opened 6 years ago

giuliac89 commented 6 years ago

Hi Oliver,

indexing performance is slow when I index lots of documents with metadata. An example of document to be indexed is this

documenttoindex

And I build the index like that

   this.index = lunr(function () {                                                         
        this.pipeline.remove(lunr.trimmer);
        this.pipeline.remove(lunr.stemmer);
        this.pipeline.remove(lunr.stopWordFilter);

        this.tokenizer = customTokenizer;
        this.tokenizer.separator = /[\s,.;:/?!()]+/;

        this.ref('xmlDocId');

        if(parsedElementsForIndexing[Object.keys(parsedElementsForIndexing)[0]].content.diplomatic) {
           this.field('diplomaticText');
           this.field('interpretativeText');
        }
        else {
           this.field('content');
        }

        this.use(addXmlDocTitleMetadata, parsedElementsForIndexing);
        this.use(addXmlDocIdMetadata, parsedElementsForIndexing);
        this.use(addParagraphMetadata, parsedElementsForIndexing);
        this.use(addPageMetadata, parsedElementsForIndexing);
        this.use(addPageIdMetadata, parsedElementsForIndexing);
        this.use(addLineMetadata, parsedElementsForIndexing);
        this.use(addDocIdMetadata, parsedElementsForIndexing);
        this.use(addPositionMetadata, parsedElementsForIndexing);

        for (var i in parsedElementsForIndexing) {
           document = map(parsedElementsForIndexing[i]);
           this.add(document);
        }
     });`

Is there a way to improve performance?

olivernn commented 6 years ago

How slow is the indexing performance? How many documents are you indexing? Its difficult to know for sure without know what the addXMetadata plugins are doing.

How much metadata are you storing against each term? Storing a lot of metadata can make the index very large.

giuliac89 commented 6 years ago

I index about 1400 documents and for each term I store 8 metadata. The plugin implementation is simple:

      function addXmlDocIdMetadata(builder, parsedElementsForIndexing) {
      var pipelineFunction = function (token) {
      var docIndex = builder.documentCount - 1;

      token.metadata['xmlDocId'] = parsedElementsForIndexing[Object.keys(parsedElementsForIndexing)[docIndex]].xmlDocId;

        return token;
     };

     lunr.Pipeline.registerFunction(pipelineFunction, 'xmlDocId');
     builder.pipeline.add(pipelineFunction);
     builder.metadataWhitelist.push('xmlDocId');}

The index performance is around 30 seconds...

olivernn commented 6 years ago

Does the 30 seconds include serialising the index to JSON, e.g. are you doing this:

JSON.stringify(idx)

If you can share the data you are indexing, as well as the code to do the indexing, I can take a look.

giuliac89 commented 6 years ago

It doesn't include serialising. You can find the project I'm working on at https://github.com/evt-project/evt-viewer and my branch is feature/search . After the environment setup you have to create a data folder under app folder, where you have to add a xml file. My code is under app/src/dataHandler/search/searchIndex.service.js You can download the xml file here: https://github.com/arojascastro/soledades/blob/master/edicion/anotada/input/soledades_anotada.xml

Follow the readme file for more information. Thanks!