krisk / Fuse

Lightweight fuzzy-search, in JavaScript
https://fusejs.io/
Apache License 2.0
17.89k stars 759 forks source link

Indices end index is 1 less than expected value. #212

Closed mattmazzola closed 4 years ago

mattmazzola commented 6 years ago

I wanted to use the indices from the match object to format the matched text such as:

text: Hey
indices: [0,1]

Output: <span class="match">H</span><span>ey</span>

However, the indices returned by fuse have the start and end match on the same character.

You can see this on the Fuse.js demo page:

Simply type o into the search and it returns [0,0] for indices as it matches 'O' in the 'Old Man's War'

[
  {
    "item": {
      "title": "Old Man's War",
      "author": {
        "firstName": "John",
        "lastName": "Scalzi"
      }
    },
    "matches": [
      {
        "indices": [
          [
            0,
            0
          ]
        ],
        "key": "title"
      },
...

Expected Result: [0,1]

I feel like this would be obvious to be a bug so perhaps it's by design but can someone help me understand how it's intended to be used to highlight text this way?

All the string operations such as slice, substr and substring would use 0, 1 not 0, 0 to get the matched characters so I ended up manually adding 1 to all the end indices to make them useful and this seems odd.

JM-Mendez commented 6 years ago

I just started using this library, but I think this is because character positions are numbered by the place where the cursor is, which is right before the character. So the correct start and end position of the matched char 'o' would be [0, 0], indicating that only that character was matched. If the result said [0,1], that would imply that the next character was matched as well.

You're right that in order to highlight it you'd have to increase the end value by one. This essentially means "highlight the string starting with the character at position [0] until you reach the character at position [1]"

mattmazzola commented 6 years ago

I think this is because character positions are numbered by the place where the cursor is, which is right before the character

I don't understand how cursors are related here. I assumed (maybe incorrectly) that the library is primarily used for searching through items so your cursor would be following the search text and not related to the matching indicies which may be split up into separate groups even though search term is single string.

If you select the first character in a text block and return the selection object you get: image

Which has 0 and 1 for offsets.

I've seen other libraries report indices using the same way as Fuse where each character being index instead of the spaces between/around and I assume there is good reason, but I was more interested why. I can easily see both ways; however, one does seem more useful than the other for consumers. From perspective of library author maybe internally simplifies logic, but for consumers you pass complexity on to them to always increment the last index for any use.

That all being said, I assume it's probably too late to make breaking change for this library anyway. Perhaps in future it could be an option when you configure Fuse object to have the library output incremented end indices for you

JM-Mendez commented 6 years ago

I don't understand how cursors are related here.

I only mentioned the cursor position because that is the anchorOffset when the selection range is collapsed. It was a poorly worded explanation I suppose.


I'm only making assumptions here since I have no connection to the library besides using it. But my guess is that you're right and the main purpose of this library is not highlighting found searches. Since it uses a Bitap implementation of the Levenshtein distance we can also target spell checkers, OCR, etc.

Here's a contrived example. Suppose I wanted to fuzzy search and highlight OCR mistakes. Let's use "Old men" as the incorrect rendition, and "Old man" as the correct term. Using your proposed indices we'd end up with matches like so:

"matches": [
      {
        "indices": [
          [ 0, 5 ],
          [ 6, 7 ]
          ],
          [ 9, 10 ]
        ],

Now the consumer would have to implement weird logic to highlight the character at index 5 ("e") as the incorrect character (should be "a"). Also imagine how another dev would feel in a few months tracking down a bug wondering why indices 0-7 are matches, but only index 5 is highlighted as incorrect.

By keeping the indices to only the matching indices we get

"matches": [
      {
        "indices": [
          [ 0, 4 ],
          [ 6, 6 ]
          ],
          [ 9, 9 ]
        ],

It's now clear where the edit distance is, and makes it much easier to target incorrect characters, with the only drawback of increasing the end index in order to highlight. I don't know if this was the author's intent, but this is how I made sense of it (like you I was also confused as to why).

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days