kerighan / eldar

Boolean text search in Python
MIT License
44 stars 9 forks source link

Support for creating an index incrementally #20

Closed distbit0 closed 1 year ago

distbit0 commented 2 years ago

I have several thousand articles in the form of text files which I would like to index, but loading them into ram in a documents[] list object and then indexing them is not feasible as a result of their size.

Could you please add support for adding documents to an index incrementally, so that not all indexed documents need to be in ram simultaneously? I'd like to be able to loop through each document and add them to the index one at a time.

Many thx

distbit0 commented 2 years ago

If this was implemented, boyer-moore would be a lot less important of a feature (at least for me) :)

kerighan commented 1 year ago

It's non trivial to incrementally add new documents given the actual implementation. I would have to create an index with on-disk datastructures, allowing for larger than RAM corpus. I have the technology to do it, but it is not yet mature enough. I can look into it as soon as I have more time (around mid-january).

distbit0 commented 1 year ago

Btw I no longer really care for indexing my docs, as I found that by specifying ignore_accent=False, I was able to 30x the speed. the unicode decode step was taking like 99% of the time it took to search my docs, and eldar only uses unicode decode when ignore_accent is set to true (which it is by default). @kerighan