traverseda / iiab-searchServices

Search services I'm writing for my personal data-archive/internet-in-a-box
GNU Affero General Public License v3.0
3 stars 0 forks source link

Parallel indexing #5

Closed traverseda closed 3 years ago

traverseda commented 4 years ago

Right now LCARS is limited to a single search indexer. Here are the problems with the existing solutions.

It's a bit of a tricky thing to solve in a distributed system. Of course if you're just indexing html the only thing we're getting out of being a distributed system is a way of doing RPC calls, and scheduling calls for the future. The overhead from the sqlite task-queue will quickly erode any performance benefits from using a task-queue.

I like using a task-queue, it makes things very simple as we don't need real RPC mechanisms and we can scale easily, but the overhead from the sqlite is ~40% (according to py-spy) for lightweight html documents, the kind which will probably make up the majority of the content most people index.

traverseda commented 4 years ago

I have a branch pwriter where I've started to implement this. Unfortunately it requires more knowledge of the whoosh backend than I actually have. I'm basing it on the multiproc writer, but I'm having a hard time figuring out how things are actually structured.

traverseda commented 3 years ago

No longer relevant since move to sqlite