lucaong / minisearch

Tiny and powerful JavaScript full-text search engine for browser and Node
https://lucaong.github.io/minisearch/
MIT License
4.82k stars 137 forks source link

Adding single field (extending fieldlist) to document #170

Closed KeKs0r closed 2 months ago

KeKs0r commented 2 years ago

Hi Luca,

I mentioned this single use case in another ticket, where we discussed several things: https://github.com/lucaong/minisearch/issues/106

I wanted to create a dedicated issue for this specific case. I am powering a filter/search of a table with minisearch. And users are able to add and remove columns from that table. the recreation of the index is quite the performance hit on that page and I am trying to make reindexing more performant on data changes. One of these scenarios is adding of a column. Currently I am creating a complete new index, just to add or remove a single field.

It would be great, if there is an api to add or remove a single field from the index. Or event doing it for every individual document with the current value of the field would be fine.

lucaong commented 2 years ago

Hi @KeKs0r , thanks for opening the issue. Let me think about it and come back to you.

lucaong commented 2 years ago

In short, the main challenge is that MiniSearch internally uses short document IDs for various efficiency reasons. Currently, it does not keep a mapping from original ID to short ID (only the opposite way around), so it doesn't know how to add a field to an existing document identified by ID.

This can probably be implemented, making it possible to add or remove some fields to/from an existing document, but it should be done in a way that does not make the data structures much larger.

Note that updating in place instead, either whole document or just a field, will not possible. One has to first remove the old document (or field, if we implement this), then add the new one. While it can be cumbersome for the application developer to keep the old document around so it can be deleted, the alternative is worse: MiniSearch would have to copy each indexed field in each document (just referencing won't work, because the document can change in place), or at least keep the list of processed tokens for each indexed field, so it can de-index them upon removal/update. This would make MiniSearch use a lot more memory and be slower to index for everyone, even those who don't use this feature. It is also possible for application developers to implement it more efficiently, even though a bit cumbersome, by implementing some "copy on write" mechanism on the documents.

KeKs0r commented 2 years ago

I think keeping the "old" document around in this case is not an issue. This is already the case for "data changes" anyways. There are 2 changes in regards to Columns

  1. Adding a column: purely additive, and maybe we find a way to just extend the index with some values for existing document Ids. Here we just have the issue you described with the original ID -> short ID mapping
  2. Removing a Column: Maybe this does not have the issue with the ID mapping, because we want to remove all values for the column. So we don't know the original ID, just wether the index came from a certain field. (Which I am not sure if that is possible)

Is the original ID -> short ID function a hash or is it random? Also if it has the mapping shortID -> original ID, it is maybe not as efficient, but maybe its fine to do a full lookup of ids, only for this use case. This way, it would not effect memory consumption for all other use cases.

lucaong commented 1 year ago

The new version v6.0.0-beta.1 includes changes that would make this feature possible in the near future (at least adding a field). I will consider this for 6.1.0.