Open mitra42 opened 4 years ago
Thanks @mitra42 ! As we try to figure this out, can you take a quick look at @traverseda's suggestions on the similar ticket below?
I actually did implement an OpenSearchDescription
file during my testing. It's very much on my radar.
I don't think it's really practical to aggregate search from a bunch of different providers. That either requires all the search engines to use the same ranking algorithm or requires a bunch (more) data duplication.
I think it's a lot easier to set up one system, and have hooks hosts can use to add/remove pages from the index.
I took a quick look at that ticket, which sounds like it would cause massive data duplication to invert every possible web page into an index held centrally. It also assumes that what you are indexing is web pages, rather than the content underlying it ...
I agree that ranking when merging is a challenge, so I'm not saying I've got the solution. But pushing the actual search INTO the apps so they can explore their own data structures / databases seems the right way to do it.
I am still looking for a good text indexing library in javascript, I can pull the text out of our metadata, but its complicated to pull out the words, remove common words, handle stemming, plurals etc, and throw into an index. If anyone knows a good one, I'd be happy to try it out.
I am still looking for a good text indexing library in javascript, I can pull the text out of our metadata, but its complicated to [...]
I'm really not a fan of javascript, but you might start with lunr for your text simplification step. Take a look at the text processing section. Of course that "search engine" is very simple, but it could provide the text-processing segment for a better search engine.
Of course just javascript isn't enough, since you're approach would require every app to provide it's own search indexing? Wouldn't that imply a new version for every app/data-set? That sounds very challenging.
Take a look at meilisearch instead?
which sounds like it would cause massive data duplication to invert every possible web page into an index held centrally.
Sort of. That's a first pass at the problem, which does involve data duplication, but there are ways to get around that while still taking the same approach. Data-duplication is, to a certain extent, a fight against latency. There are three main "levels" of data-duplication in my mental model, provided so we can make sure we're on the same page.
Tools like ripgrep can search/tokenize text on the fly, with no search index. They can be shockingly fast, but are still limited by disk IO. Searching through an 80GB wikipedia file will take as long as using DD to copy it, or longer. We're talking several minutes here.
We tokenize text, simplify it into root words, and de-duplicate it. We can't reconstruct the original document from this tokenized text, since (among other things) it doesn't store the order the words appeared in. It just throws all the words into a bag.
This is how any serious search engine essentially works, and is the minimum amount of data duplication we can get away with.
The same as the previous level, but we also store the article text along-side the "bag of words". This allows us to quickly present a summary of the article, with pertinent phrases highlighted. If you're willing to accept higher latency is returning search results, you could instead extract the text from the document again every time you show the search result of that page. For local web pages with zero network latency that's not too bad. This brings it back in-line with level 1 duplication.
Right now my search server is pure level 2
, and relies on spidering the HTML content. My intent was to provide progressive-enhancements, instead of being spidered an app could send a request to the search engine, telling it what to index. Instead of storing the full article text you could store a link to where the article text could be found, pull it from that URL, and highlight it.
Providing an OpenSearch endpoint for each app is a lot of duplication of effort.
like that meilisearch is not java and rust seems to be an up and comer.
@tim-moody The memory usage should be pretty reasonable, so I'm hopeful. The api is a whole lot simpler than xapian/solr too. Just discovered it today, so I haven't had any time to do any real testing.
Also easy to compile/deploy.
It's worth noting that en_wikipedia_all_nopic
is only ~36G compressed at the time of this writing. That's the largest text dataset we're likely to encounter, so our worst case for data-duplication is not actually all that much, compared to app multimedia content, presuming our index is compressed. Multimedia is always going to take up the vast majority of the storage space.
I'm using filesystem level compression on the search index, but your results may vary.
I still have three big issues with this approach
a) I'm not convinced that duplicating all the text is viable - its a reasonable presumption that most sites will be multimedia dominated, but that won't necessarily be true for all sites. I could be wrong about this one. Wikipedia is a good example - adding another 36GB is non-trivial in a constrained environment.
b) any app is likely to have to have its own search engine anyway, so we are potentially triplicating data if we don't make use of it. I bet Kiwix, Mediawiki, Kolibri etc all have to do this. I have two - one that works when online, and I'll have one (to be built) when offline. Opening my online tool to opensearch took me only a couple of hours, and I bet that would be similar for any other app, and we are likely to have to do it to make the apps work on OLIP anyway - which you'll note was the point of this issue :-)
c) Spidering is not going to work for a lot of sites, if you try and spider the internetarchive for example you'll end up spidering the entire Internet Archive (60 Petabytes), since if the box is online we'll spider everything.
I strongly believe this needs to be app specific, some kind of API. Opensearch is one viable one, but if you really insist on duplicating the data, then at worst lets collaborate on an API so that an App can feed the data to your search engine, both bulk (one time at startup), and incrementally as we pull items from the net and cache them.
@tim-moody have you tried learning Rust yet - I've gone through the process, and its one of the most idiosyncratic languages I've come across. Its one advantage seems to be that it runs efficiently in browsers.
lunr looks like the right approach, I should be able to feed it our metadata files directly and hook the search to opensearch easily, or to an API should one get adopted.
Being javascript helps a LOT because we can drop it straight into our server, and index in the crawler and/or as we write the cache while the user is browsing.
@mitra42 beware of memory usage with lunr. We'd like to avoid keeping the entire index in memory. Maybe lunr+lmdb? Presuming there are node bindings for lmdb?
So looks like lunr might not be the right solution https://github.com/olivernn/lunr.js/issues/426 suggests people have requested DB storage but its not been forthcoming. It probably wouldn't be hard for someone who knew Lunr to use leveldb for the storage, but might be easier to keep hunting for an alternative javascript text indexer/searcher.
Also https://github.com/olivernn/lunr.js/issues/306 same issues with memory
lets collaborate on an API so that an App can feed the data to your search engine,
Absolutely. The first step, in my opinion, would be figuring out a data structure of articles. Here's a first draft. Let's take a look at some of the fields. Everything is optional except id
, of course you need to include some of these if you want it to be actually searchable.
id: A unique id for this article, a lot of the time this can just be a URL but when dealing with WARC files that store multiple versions of the same URL we will need to take a slightly more nuanced approach. Adding two documents with the same ID replaces the older one.
title: The header tag that returns with each entry.
real_url: the actual url to the content, if it differs from what we display in our results.
url: the url to show in the search results. If the content originally came from wikipedia it should show wikipedia.org/somearticle
summary: The html document to display when we show a summary. We strip the text out of the html and store that.
summary_nostore: Like summary, but don't save the resulting document. Must be used with text_url
if you want to show any kind of summary.
text_url: A url returning an HTML representation of a page. We use the same text-extraction as we do for the summary. Of course you can also return the text representation of the page, and that save us a step.
data_url: A url that returns a json-encoded "mask" that can cover any of these fields for display. Allows for more dynamism in the data display, as it's loading data at display-time.
display_type: If the article is a video, image, or audio file, we can specify a custom display_type which tells the search UI to render the result differently.
search_boost: Request that this result be more/less relevant than it normally would be. I'll try to find some appropriate ranges for this. Indexes of source code should probably be less relevant that indexes of wikipedia/stackoverflow, as an example. Indexes of user-created content should probably be the most important.
You can of course include custom fields, and along with display_type
show them in search results. In the future meilisearch might support faceting, and those custom fields should be usable for faceting some time in the future.
Does that make sense for an initial data type specification? Presumably there would be a url endpoint somewhere where you can call that accepts those data types, and another that lets you delete an article by ID. Also a command line utility that lets you GC dangling articles, articles that don't exist any more but which weren't cleaned up properly.
Note this issue is getting kinda-crowded, its now got the issues of
Are there any good off-the-shelf opensearch aggregators? I can't find any. It looks like it would be a whole project in and of itself, and that there might be some patent issues.
Moved API to it's own issue.
a) That patent expires this month b) the wikipedia article is just plain wrong - I used a search aggregator in about 1992 that searched multiple databases and returned those results in a common UI, it was called "WAIS" (and developed by the Internet Archive's founder) . c) I don't know, but the legal case in 2012 might have narrowed its focus, text based search aggregation certainly predated that patent by a long time.
I don't know what OLIP is using for opensearch aggregation.
Honestly I'm not too worried about the patent for a small, open-source, non-commercial/charitable project like this. What is OLIP?
Another platform like IIAB - developed by Bibliotecs Sans Frontiers. The name comes from OffLine Internet Platform, and its one of the results of the Offline Internet Consortium (@holta and I will both be at their meeting next week). Its docker-based so likely to be on larger boxes rather than the RPI's that IIAB is optimized for.
Well I am doing this to scratch my own itch, but I should be able to implement more concrete OpenSearch functionality than just the auto-discovery. I'm still somewhat skeptical that an OpenSearch aggregator will work well (especially as far as latency is concerned), and I'm worried about the amount of work required to implement OpenSearch functionality into each sub-app,
That being said, there's no reason an OpenSearch aggregator can't just use my search service for things that don't provide an OpenSearch endpoint.
I just can't imagine that kind of approach would work performantly or return very good results, I'd love to be proven wrong though.
As part of making the Offline Internet Archive available on OLIP as well as IIAB (and Rachel and standalone), we've implemented the opensearch spec. Though having problems testing on OLIP currently.
Would it make sense for IIAB to implement opensearch as well ?
It looks to me like there is a UI (Search box) that composes a bunch of queries that go to plugins on apps like interntarchive, the UI then collects those results to return a common response.
opensearch wasnt hard to implement from the plugin side (the plugin never has to parse XML). It might be a bit harder from the common server as it probably has to parse XML and then turn into HTML.
I'm not sure if there are any competitors to OpenSearch (since I think Google supports Opensearch). Obviously XML sucks (technical term) but I havent seen a JSON equivalent. It would be trivial, from the app side, to return a semantically equivalent JSON structure if that made the implementation of the common part easier.