Closed distbit0 closed 1 year ago
If this was implemented, boyer-moore would be a lot less important of a feature (at least for me) :)
It's non trivial to incrementally add new documents given the actual implementation. I would have to create an index with on-disk datastructures, allowing for larger than RAM corpus. I have the technology to do it, but it is not yet mature enough. I can look into it as soon as I have more time (around mid-january).
Btw I no longer really care for indexing my docs, as I found that by specifying ignore_accent=False, I was able to 30x the speed. the unicode decode step was taking like 99% of the time it took to search my docs, and eldar only uses unicode decode when ignore_accent is set to true (which it is by default). @kerighan
I have several thousand articles in the form of text files which I would like to index, but loading them into ram in a documents[] list object and then indexing them is not feasible as a result of their size.
Could you please add support for adding documents to an index incrementally, so that not all indexed documents need to be in ram simultaneously? I'd like to be able to loop through each document and add them to the index one at a time.
Many thx