lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.83k stars 137 forks source link

Suggestion: Add the ability to index an array field #29

Closed CheloXL closed 4 years ago

CheloXL commented 4 years ago

Hi, I currently need to index all the strings found in a field that it's an array. Right now I'm creating an object with several fields (item1, item2... item9) and assigning the array values to each field and then indexing that document. Of course the above has several limitations, but right now I don't see any other way to do so.

Could you add support to index an array of strings as value of a field? Thanks in advance!

lucaong commented 4 years ago

Hi @CheloXL , thanks for your question. If I understand correctly what you need, this feature is already implemented in MiniSearch. You can specify a custom field extraction logic to handle the array field. Here's how:

// Assuming that our documents look like:
const documents = [
  { id: 1, title: 'Moby Dick', tags: ['fiction', 'whale'] },
  { id: 2, title: 'Zen and the Art of Motorcycle Maintenance', tags: ['fiction', 'zen'] },
  { id: 3, title: 'Neuromancer', tags: ['fiction', 'cyberpunk'] },
  { id: 4, title: 'Zen and the Art of Archery', tags: ['non-fiction', 'zen'] },
  // ...and more
]

// We can support array fields (tags) with a custom `extractField` function,
// for example just joining the tags by space:
let miniSearch = new MiniSearch({
  fields: ['title', 'tags'],
  extractField: (document, fieldName) => {
    const value = document[fieldName]
    // If field value is an array, join it by space
    return Array.isArray(value) ? value.join(' ') : value
  }
})

This way, the whole array will be indexed as one field. Is this meeting your use-case?

CheloXL commented 4 years ago

I already know that I could do so. My problem is that each entry in the array is a complete sentence, not a single word. Would that not break the order of the results? You use the field length to calculate the score, and in this case that field will vary greatly in size (from two words -the minimun single entry- to above 10/12 words -about 4 entries-).

lucaong commented 4 years ago

But if the array field was indexed as a single one, wouldn't that be the same? For example, let's assume that we have a tags field in our documents, that can contain one or more tags. Imagine that you have these three documents:

[
  { id: 1, tags: ['tomato'] },
  { id: 2, tags: ['tomato', 'banana', 'apple', 'mango', 'peach'] },
  { id: 3, tags: ['banana', 'mango'] }
]

If I search for tomato, I sounds reasonable that the result contains documents 1 and 2, with document 1 having a higher score, as both match tomato, but document 1 is more specific.

As a matter of fact, the current API gives you the choice: you can either consider the whole array as a single field (and therefore weight terms by their prevalence), or index each element as a different field.

But maybe I am misunderstanding your need? If so, can you give me an example?

CheloXL commented 4 years ago

My objects would looks like:

[
  { id: 1, name: "The one", akas: ["The famous", "The incredible shrinking"],
  { id: 2, name: "Another sample", akas: ["More than an aka", "Also known as"],
  ...
]

So, joining the akas would make weights differents as using each entry in a new field. In the first example, the "field" value would be the famous the incredible shrinking, making the second entry in the array has less weight (by prevalence) than if it where at start.

Or am I the one that are misunderstanting the way it works?

lucaong commented 4 years ago

Ah, no, "The incredible shrinking" would have the same importance as "The famous". The field length (in terms, not in tokens) is used to compute the score, but the position in the field is not affecting the score.

For example, suppose you have a field with value "apple peach tomato banana". If you search for apple or for tomato you get the same score for this document: in both cases, 1 out of 4 terms in the field matches the search. You would get a lower score if the field had the value "apple peach tomato banana apricot mango", because now only 1 out of 6 terms matches.

Does this clarify that?

CheloXL commented 4 years ago

Ahh... perfect then. I though the position of the word changed the weight too. Since that's not the case, yes, I could use the extractField feature as noted above. Thanks!

lucaong commented 4 years ago

Great :)

Just to clarify, there is a difference between the two techniques, so it makes sense to choose the one you prefer. Let's say you have these documents:

[
  { id: 1, name: "The one", akas: ["The famous", "The incredible shrinking", "The well known"] },
  { id: 2, name: "The great", akas: ["The supreme", "The famous"],
  ...
]

If you index each element of the akas array as a different field, you end up with something like this:

[
  {
    id: 1,
    name: "The one",
    akas_1: "The famous",
    akas_2: "The incredible shrinking",
    akas_3: "The well known"
  },
  {
    id: 2,
    name: "The great",
    akas_1: "The supreme",
    akas_2: "The famous",
    akas_3: null
  }
  ...
]

Therefore, if you search for famous in all fields, you will find both documents (they both match) with the same score (in both cases 1 out of 2 terms match, in the akas_1 field for document 1 and akas_2 for document 2). One drawback is that you have to decide upfront how many akas_<n> fields you can have.

If, instead, you join all of them in one field, you end up with this:

[
  {
    id: 1,
    name: "The one",
    akas: "The famous The incredible shrinking The well known"
  },
  {
    id: 2,
    name: "The great",
    akas: "The supreme The famous"
  }
  ...
]

In this case, if you search for famous, you still get both documents, but document 2 will have a slightly higher score, as 1 out of 4 terms match in akas, whereas document 1 matches 1 out of 8. The fact that "famous" occurs before in document 1 does not have an effect on the score, while the fact that the akas field contains more terms has.

In most cases, this second case is preferable and more flexible: the array can be as long as it needs, you still find every match, and document 2 is considered to be slightly more specific.