traverseda / iiab-searchServices

Search services I'm writing for my personal data-archive/internet-in-a-box
GNU Affero General Public License v3.0
3 stars 0 forks source link

Indexing and search target should be local file #6

Open tim-moody opened 4 years ago

tim-moody commented 4 years ago

The usage doc shows a command lcars-cli index http://someurl, which suggests we are indexing an external site. I was expecting an index of all the (html) files under a local directory. For example, /library/www/html/modules/en-cdc/ which is rendered by the server as /modules/en-cdc/ . Are you expecting that such a content module would be indexed via http://localhost/modules/en-cdc/? or even http://box.lan/modules/en-cdc/? Does the index expect to include the host name in the search results?

This might make some sense for kiwix or kalite, but these will be very hard to spider, the first because there is no index that will visit every page and the latter possibly because any indexing is computed in javascript.

btw, what is lcars?

traverseda commented 4 years ago

LCARS is the library-computer access and retrieval system from star trek. It's a placeholder name, as I didn't want to be too presumptions taking the IIAB name.

Are you expecting that such a content module would be indexed via http://localhost/modules/en-cdc/?

Yeah, that is the intent. There's enough non-html content that I think just ensuring everything gets indexed via the web is for the best. Like we can't exactly index a kiwix ZIM file. The canonical representation of the url is something I've thought about. I think the ideal solution would be to save it a http://localhost/whatever then in the template when we render a link we remove the http://localhost part so it just links to /whatever. That's a problem if you're indexing from more than one machine though, as localhost will point to different locations...

the first because there is no index that will visit every page

I mean any pages that is linked to would be grabbed eventually. It's less then ideal, if there was a way to list all the links in a ZIM file we could use that as the seed for our indexer.

latter possibly because any indexing is computed in javascript.

Making the indexer able to process javascript is not out of the question, although I do worry about the memory constraints since it would basically involve running a full browser.


I've hit a bit of a road-block performance wise. I've been trying to adapt the multiprocessing based indexer to a distributed task-queue, as there are some issues running it inside of a demonized process. I should have some more time to work on that coming up pretty soon. but it would mean we'd index things at ~40pages/second, which makes stuff like indexing wikipedia more doable.

tim-moody commented 4 years ago

the first because there is no index that will visit every page

I mean any pages that is linked to would be grabbed eventually. It's less then ideal, if there was a way to list all the links in a ZIM file we could use that as the seed for our indexer.

Possible, but not guaranteed. Have a look at http://iiab.me/kiwix/wikipedia_en_all_maxi_2018-10/A/User:Stephane_(Kiwix)_Landing.html. wikis rely heavily on search to find pages. Also, the zim has already been indexed. I would try to get the index out of the file rather than spidering 5M pages.

for modules, which will benefit the most from indexing, I would run off the file system. should be a lot faster. that is what kiwix does when creating the zim.

traverseda commented 4 years ago

I'm not sure that nginx serving the HTML would add any significant overhead. We're mostly either CPU bound (in generating the search index) or IO bound (in saving the search index). Reading from a canonical web source has a number of advantages though. For one thing we can't guarantee a file on disk is actually accessible from the web interface, like imagine we point it at a folder with an .htaccess file. We don't expect it to actually process the htaccess file, but that file will effect how the page is actually rendered. I think it's a lot more elegant to index the actual content as it is available to the user, it removes an entire very large class of edge cases and introduces very little overhead for static file serving. It also lets us do distributed crawling, which honestly isn't a big benefit of IIAB but we can merge two unrelated search indexes together, so we could prepare an index for a dataset ahead of time and then people could merge it in.

Also honestly I'm pretty much just using the task-queue as an easy IPC mechanism, since we sort of need some way to interact with long running tasks, being able to work as part of a cluster is just an added benefit.

I don't think it's possible to directly extract a ZIM file's xapian index into whoosh, they use very different stemming functions so there's a fundamental incompatibility in how they store the data. In order to do that we'd need to essentially make a "meta" search engine, one that combines the results from different search backends and somehow ranks them. I find this approach generally isn't very good, if there are any issues returning results from sub-indexes the whole thing tends to slow to a crawl, and some search systems are worse than others so you're always bound to wait for the slowest possible search engine in your meta-search pool.

That kind of meta-search approach is possible, but I've never seen it done well, and I'm skeptical it's possible to do it well unless you impose some very strict requirements on the clients.

tim-moody commented 4 years ago

I think it's a lot more elegant to index the actual content as it is available to the user,

Then you should use headless chrome to get the pages to index so js gets a chance to run,

I don't think it's possible to directly extract a ZIM file's xapian index into whoosh

That should be a red flag. I expect the time required to index 5M articles on a target system to be considerable.

we can merge two unrelated search indexes together

I think this is very important. I'm not sure why comparative ranking is not a problem with this approach but is with meta search. Empirically, merging kiwix search results across multiple zims is indeed very slow.

so we could prepare an index for a dataset ahead of time and then people could merge it in.

who is the 'we' and where are these datasets stored? modules tend to be pretty static, so this is a good approach. zims are produced more frequently.

btw, kiwix used to produce an external index. it might be possible to get them to do so again. I think part of the downside was that the index was actually a directory of many files, so packaging was an issue.

traverseda commented 4 years ago

Then you should use headless chrome to get the pages to index so js gets a chance to run,

I've done that in the past, good to know that's a high priority feature.

I'm not sure why comparative ranking is not a problem with this approach but is with meta search.

It's a latency/data-format issues. So the search backends try to tokenize keywords. Like if you wanted to search for "theming", like if you were trying to find out information on css or something, the search engine would tokenize that word into a base word. For whoosh it tokenizes the word Themes into the word theme, while xapian (with the default strategy) tokenizes it as themes. This is why we can't just merge a xapian and whoosh database into one, even if we could make the appropriate shim code. The tokenization strategy is different enough that if you searched for themes under whoosh you wouldn't find any of the xapian documents that mentioned the word themes. That's the most basic example of that kind of issues, but there are going to be a huge number of edge cases due to minor differences in the tokenization strategy.

That means that in order to get good rankings we need to send the search query to each search backend, get the first few page-fulls of results, and re-index every one of those documents. There are other ways to mux the results from different search engines together, but they will tend to produce worse results.

That should be a red flag. I expect the time required to index 5M articles on a target system to be considerable.

I really don't think a meta-search approach is going to work. We can create distributed search indexes in advance but we'd need a place to store it. Let's presume that we do manage to figure out some clean way to merge in results from the pre-indexed ZIM files, we still need to index a bunch of content right? How do we deal with that?

traverseda commented 4 years ago

Worst case I'm imagining weeks of indexing if there's a lot of static content. My current performance target is to get it to the point where you can index wikipedia in around a week on a raspberry pi. I think that's achievable, but I haven't had time to really dig into the filestore backend, and how to efficiently merge segments together. I should have some more info soon.

The headless chromium is a lot easier, and was always something I was planning. I'll prioritize that.