Parallel indexing - Githubissues

traverseda commented 4 years ago

Right now LCARS is limited to a single search indexer. Here are the problems with the existing solutions.

AsyncWriter, quickly ends up caching every enqueued document in memory, using all the ram
BufferedWriter, has database corruption issues.
writer(procs=8), Can't daemonize code from inside one of our worker daemons

It's a bit of a tricky thing to solve in a distributed system. Of course if you're just indexing html the only thing we're getting out of being a distributed system is a way of doing RPC calls, and scheduling calls for the future. The overhead from the sqlite task-queue will quickly erode any performance benefits from using a task-queue.

I like using a task-queue, it makes things very simple as we don't need real RPC mechanisms and we can scale easily, but the overhead from the sqlite is ~40% (according to py-spy) for lightweight html documents, the kind which will probably make up the majority of the content most people index.

traverseda commented 4 years ago

I have a branch pwriter where I've started to implement this. Unfortunately it requires more knowledge of the whoosh backend than I actually have. I'm basing it on the multiproc writer, but I'm having a hard time figuring out how things are actually structured.

traverseda commented 3 years ago

No longer relevant since move to sqlite

traverseda / iiab-searchServices

Parallel indexing #5