traverseda / iiab-searchServices

Search services I'm writing for my personal data-archive/internet-in-a-box
GNU Affero General Public License v3.0
3 stars 0 forks source link

Proposed Trial #4

Open tim-moody opened 4 years ago

tim-moody commented 4 years ago

2 1/2 years ago I scraped the CDC web site. Of course I couldn't include its search capability because I don't have the back end. This would be a great working exercise as the site is very extensive and yet easily installable.

@traverseda you can use the Admin Console to download and install it and then run your indexing and searching on the result. We can then integrate that with the en-cdc module which is mostly static.

traverseda commented 4 years ago

I've been trying it out on about 4TB of mixed epubs and PDFs, so far there are some annoying stability issues, but the results themselves are pretty good.

I'll give that a shot, I'll need to actually install IIAB.

traverseda commented 4 years ago

Well, my distributed-crawling is vastly outstripping whoosh's ability to actually index the content. I spent a fair bit of time optimizing the HTML text extractor so it's very fast, and it's running on a task-queue meant for dealing with 100MB+ pdf files. I probably shouldn't be putting the url dededuplicator in the same single queue as the whoosh indexer, but I don't want to spawn too many subprocesses,

We'll have to talk about what kind of defaults would work best for IIAB, and the kind of content it's bringing in, and I'll see if I can get whoosh running a bit faster.