jeancroy / FuzzySearch

:mag: Fast autocomplete suggestion engine using approximate string matching
MIT License
194 stars 32 forks source link

Caching index #21

Open nhaberl opened 5 years ago

nhaberl commented 5 years ago

Would like to store the Index in local storage once it has built for a lot of entries. Is there a public API for retrieving / storing?
And any issues known when compressing the index with ez?

jeancroy commented 5 years ago

Hi, this is an interesting direction, I'll try to assist you achieving it. This is not supported out of the box, but we can probably build it.

If you will I'll introduce some members that would be useful to keep in mind while serializing the index.

Original collection

this.source: [original]

We interact with this in a read only fashion. Unless user explicitly ask to update it, see this.add(item,true)

Processing cache

this.index: [{item:[original], fields[[[]]] }]
this.keys:[]

Despite the name this is more a cache than an index data structure.

Field is a jagged array:

Dynamic update

this.options.identify_item  // Get an original item, return an unique id
index_map: {},         // Json Map of unique id to a position in this.index[]
nb_indexed: 0,         // can init to this.index.length

Those are needed to dynamically add items after the fact, safe to ignore I'd still keep nb_indexed with a proper value

Speed-Up data structure (Optional)

this.options.use_index_store // Enable the data structure
this.store: {},     

The format of this data structure is a dictionary from sub-sequence of indexed words
to index of entries that contain the word. It should serialize very well.

{
"w":[1,2,4,5,7,8,9,13]
"wo": [1,2,5,7,9]
"wor": [1,3,7,13]
"wrd": [1,3,7]
"wrs": [1,3,7,11]
"wds": [1,3,7,9]
}

Solution A

Honestly the best bang for buck would be to not serialize this.index (it's relatively fast). And basically rebuilt the object on page load from this.source with the store turned off.

At at later point I'd import the pre-computed store, switch the store option to on, and it should just work, provided you processed the source in the same order.

Solution B

If the above is still too slow I'd serialize all of the above. Perhaps without the pointer as Indexed.Item and serializing a Indexed.ItemId instead. Then this can be repaired when you re-hydrate the object.

o0101 commented 2 years ago

@jeancroy I am also interested in serializing, because I am use your wonderful fz-search library to add typo search to a large document collection (web pages for a personal archive).

For example, say I have a document about "Meghan Markle" in my archive, but I do not know her name is spelled "Meghan" and instead spell it "Megan" most other JS document search solutions will not really work with such typos, and I will see my archive contains no such documents. This is sad, and not a good experience!

So I went looking for a fuzzy search library, but after trying one (fuzzysort, and examining its derivatives) -- I saw performance problems, and was not happy with it.

Then I discovered your (in my view, underappreciated) gem. So far your library has been great! (just GREAT!), but I am interested in serializing it. To my eyes, you are engaging in all sorts of black magic voodoo inside your code--highly intelligent stuff--and I can't figure out how to serialize it myself. Nor do I really wish to try, as it will take time and I will probably fuck it up. And I don't really have that time to do that, I need to focus on other things.

I have a general purpose extended serialization library (https://github.com/i5ik/weird-json) but I have not implemented support for circular references yet, so I don't think I could just plug it in there...I know you must be super busy, and this is definitely not your job, but I would LOVE (❤️) it if you could implement this.

jeancroy commented 2 years ago

OK so I've done this some time ago so I'll try to use your recent experience to judge things.

1) Fuzzy search is full text search. It's a bit unfortunate as it imply to store all documents in memory. I still make some effort to be efficient

2) There is a simple data structure (store) that I use to select a few elements to do the full text search on. Basically a minimum quality heuristic/requirement before doing more computation.

For very large project it could make sense for that step to occur in a sql database


There's a few things I do before any search.

this.index: basically I reformat the data you made searchable into a standard shape for easy looping, I also split string into words and keep lowercase form. It's more a preparation an an index really.

this.store: an actual index data structure.


The actual index (this.store) is a "dictionary string -> array of int" so it should be trivial to serialize.

The int in question is an position index used in

Additionally there's this.index_map: dictionary of typeof(id) -> int this is used by add to prevent repeat addition of the same item.


All in all there's no loop if your original dataset contains no loop.

"Pointer to original object", this could be an integer position in this.source, it would avoid duplication in serialization. On the flip side it would be harder to maintain if there's such a thing as removal.

jeancroy commented 2 years ago

@i5ik I'd like some timing information on how long does it take you to initialize a fuzzy search object. And maybe how many entries / how many fields by entry / how many characters of text per entry on average