olivernn / lunr.js

A bit like Solr, but much smaller and not as bright
http://lunrjs.com
MIT License
8.87k stars 546 forks source link

Not all tokens are being indexed #487

Open npearson72 opened 3 years ago

npearson72 commented 3 years ago

I have the following setup:

    records = [
      {
        id: 1,
        title: 'Test 1',
        description: 'It is her couch',
        url: 'www.example.com',
        tags: 'a,b,c'
      },
      {
        id: 2,
        title: 'Test 2',
        description: "The couch is her's",
        url: 'www.sample.com',
        tags: 'x,y,z'
      }
    ];

    const idx = lunr(function () {
      this.ref('id');
      this.field('title', { boost: 100 });
      this.field('description');
      this.field('tags');
      this.field('url');

      this.pipeline.remove(lunr.stemmer);
      this.searchPipeline.remove(lunr.stemmer);

      for (const record of records) {
        const tunedRecord = {
          ...record,
          description: record.description,
          tags: record.tags.split(','),
          url: record.url.split(/\W+/)
        };

        this.add(tunedRecord);
      }
    });

The resulting invertedIndex is:

Screen Shot 2021-01-17 at 3 27 39 AM

Notice that her's (from record 2) was indexed, but her (from record 1) was not.

The same happens if I remove record 2. A single record with the word her does not get indexed.

Is there a trick to this or is this a bug?

FYI: I have found other similar occurrences.

ramirezmike commented 3 years ago

I ran into this too and realized what was going on

see #480

basically, there's a default pipeline "stopWordFilter" that filters out a set of small words (including "her" or in my case "get"). If you want to include those, just remove the pipeline

config.searchPipeline.remove(lunr.stopWordFilter)

cheers!

npearson72 commented 3 years ago

I thin my point is that the stopWordFilter is where the bug is