samhh / bukubrow-webext

WebExtension for Buku
GNU General Public License v3.0
312 stars 22 forks source link

Fuzzy finding #121

Open EmilianoCostantini opened 5 years ago

EmilianoCostantini commented 5 years ago

Buku has interesting options involving search with regexps, such as:

It would be great if they could be leveraged from within Bukubrow searchbox, maybe by means of dedicated settings in the addon Preferences.

Consider for instance a bookmark having

as substring in either title or comments. Such bookmark should appear in Bukubrow search results for each of the following keyword combinations:

just like it does in the ordinary bookmark sidebar's searchbox (Ctrl-B).

samhh commented 5 years ago

Is this feature request essentially to add fuzzy matching? That's something I've wondered about.

EmilianoCostantini commented 5 years ago

Fuzzy matching would be awesome, but even more taxing than what I was thinking. My idea was deterministically calculate the keywords permutations, then add regExps. No stochasticity involved.

Say for instance the user enters in the searchbox the keywords:

That would be 4 different words, that can be ordered in 4! (that is, 24) different ways:

  1. xxxxxx yyyyyy zzzzzz wwwwww
  2. xxxxxx yyyyyy wwwwww zzzzzz
  3. xxxxxx wwwwww zzzzzz yyyyyy
  4. wwwwww yyyyyy zzzzzz xxxxxx
  5. zzzzzz yyyyyy xxxxxx wwwwww
  6. ...

(and so on and so forth, up till the 24th possible permutation.)

Each permutation should be converted to regexp:

  1. /([\s\S]*)xxxxxx([\s\S]*)yyyyyy([\s\S]*)zzzzzz([\s\S]*)wwwwww([\s\S]*)/i
  2. /([\s\S]*)xxxxxx([\s\S]*)yyyyyy([\s\S]*)wwwwww([\s\S]*)zzzzzz([\s\S]*)/i
  3. /([\s\S]*)xxxxxx([\s\S]*)wwwwww([\s\S]*)zzzzzz([\s\S]*)yyyyyy([\s\S]*)/i
  4. /([\s\S]*)wwwwww([\s\S]*)yyyyyy([\s\S]*)zzzzzz([\s\S]*)xxxxxx([\s\S]*)/i
  5. /([\s\S]*)zzzzzz([\s\S]*)yyyyyy([\s\S]*)xxxxxx([\s\S]*)wwwwww([\s\S]*)/i
  6. ...

Then each bookmark matching at least one of the 24 regexps in either title or description/comments should be added to results.

As you can check on RegExr, such a bookmark would be —for instance— one containing the substrings:

regardless both the order and the case, just as long as each and every keyword is there.


Even making it as efficient as possible, though, such kind of business logic could relevantly impact performances; therefore it could be good to make it optional by means of dedicated item/s in the addon Preferences page.

samhh commented 5 years ago

I have surprisingly little knowledge about algorithms so I'll have to do some research on what you've touched on in your comment before coming back to this ticket, but I would like to incorporate a feature like this at some point. It might be a while though, I need to prioritise performance issues with enormous databases.

EmilianoCostantini commented 5 years ago

You're doing a priceless job, really. The uncanny decision to remove the 'Description' field by Mozilla folks has caused me some serious discomfort. Buku coupled with your addon could really be a life saver for those who, like me, heavily depend on such field.

P.S. as you can see I've tried to move the feature request upstream; this seems to make more sense to me.

samhh commented 5 years ago

Ah, a heavy user of the description field, you are the perfect target audience for a question I've been pondering.

Which of these would be preferable to you?

EmilianoCostantini commented 5 years ago

The name is almost negligible to me, therefore I would choose name and description without hesitation :)

samhh commented 5 years ago

Note to self: Given that there's work ongoing to try and move the filtering logic over to the host, when that happens it may be possible to easily implement fuzzy finding for each regex-matched group using a library like this. fzf (written in Go) is a good benchmark for how intuitive the matching can be.