traverseda / iiab-searchServices

Search services I'm writing for my personal data-archive/internet-in-a-box
GNU Affero General Public License v3.0
3 stars 0 forks source link

Technology Choices #3

Open tim-moody opened 4 years ago

tim-moody commented 4 years ago

It looks like you settled on whoosh, which leads me to wonder what other choices you considered. For example. Xapian is used by Kiwix and seems to have much more recent and active development. Whoosh was used by the original IIAB OSM module and I remember having package issues, and the latest release is from 2016, but maybe you have had a better experience.

tim-moody commented 4 years ago

Went back and read the readme and I guess you somewhat covered this. You found whoosh a lot faster than xapian at indexing, but were somewhat unclear on which is faster on retrieval. If there is a choice we should favor retrieval as indexing is only done once for our mostly static content.

traverseda commented 4 years ago

So far no package issues with whoosh, it seems to just install and work, and being a pure python package I expect that to continue for the foreseeable future. In comparison xapian relies on a bunch of C code that is less portable, you need to install it via your distro's package manager (there aren't pip packages for xapian). It might not be getting significant updates, but to a certain extent it's also complete.

I've looked into a number of search backends, originally this did use xapian, the big thing I ran into was that I ended up having multiple sources of truth. I had a much harder time trying to control what xapian actually stored, vs what it just indexed. When I was using xapian I found that I had to keep a separate database for things like "when was the last time this was indexed".

Xapian and whoosh seem to be pretty comparable as far as search speed is concerned, at least for medium-sized datasets.

But yeah, it's a tough choice. I'm left feeling like if I put in a bunch more effort in I could get better results with xapian, get faster search and lower memory usage, but so far I've had a pretty difficult time working with it.

Another thing I noticed is that whoosh seems to actually produce better search results, although of course that's subjective and hard to quantify. It does seem to have a more flexible ranking system, and I think the defaults it uses are better. Once again though that's the kind of thing where maybe I can get similar results with xapian if I put in a bunch of effort.

There is a pluggable search-engine backed library I could use, so you can swap out the underlying engines. Unfortunately it seems pretty deeply integrated with django, which brings in a whole bunch of its own problems. It also has some fairly leaky abstractions and generally requires a lot more sysadmin efforts.

I guess my defense is that whoosh is not that much worse performance wise but it's a lot better documented and easier to work with.

tim-moody commented 4 years ago

Another question: How much overhead does Flask add to the search server?

traverseda commented 4 years ago

Basically none, responding to HTTP requests requires very little to begin with, and flask is about as minimal as they come.

traverseda commented 4 years ago

I'm pretty sure whoosh is not "almost as fast" outside of contrived cases. It's still a lot easier to work with, but....

traverseda commented 3 years ago

You were 100% right, whoosh was too slow. I've moved the master to sqlite's full text search which works pretty well and doesn't requires any OS packages, meaning my goal of being able to install it 100% through pip continues to work. For raw html I was getting ~250 pages/seconds indexed. I'll try running the test again and see if it still works.

Thanks to sqlite we can merge two different databases. Haven't gotten around to that but it means we should be able to pretty easily distribute pre-indexed content.

I've very happy with sqlite full-text-search. All the other options have come with significant drawbacks but so far sqlite seems perfect (well I'd like it if it supported transparent compression). Hopefully now that I've got that solved I'll find some more time to work on this.

tim-moody commented 3 years ago

sounds like real progress. I especially like the merger of precalculated search indices Should be possible to extract the zim index with some work