searchmysite / searchmysite.net

searchmysite.net is an open source search engine and search as a service
GNU Affero General Public License v3.0
75 stars 7 forks source link

Indexing: Index wikipedia #10

Closed m-i-l closed 2 years ago

m-i-l commented 3 years ago

A user submitted https://en.wikipedia.org/ via Quick Add. As per my rejection note (which you can see by trying to resubmit) I would love to index wikipedia, but it would require custom dev and likely an infra upgrade.

The big advantage of including wikipedia would be that it would turn searchmysite.net from a niche search into a more general search, and therefore give the site more "stickiness". It wouldn't be a departure from the original philopsophy, which is (among other things) to index just the "good stuff", to penalise pages with adverts, and to focus on personal and independent websites at first (I think wikipedia still falls under the category of "independent website").

However, given the 6M+ English pages, and 20M+ other languages, spidering it via the normal approach would not be a good idea. Indeed the page at https://en.wikipedia.org/wiki/Wikipedia:Database_download even says "Please do not use a web crawler to download large numbers of articles." A better idea would be to periodically download the database, and have a custom indexer for that database. The tblIndexedDomains could have a column added for indexer type. It may require some Solr schema changes too in order to get the most out of it. It would have to be listed as not owner verified, and of course an exception made to increase the 50 page non-owner verified page limit.

Not a trivial undertaking, and it would almost certainly require a CPU, memory, and disk upgrade for the production sever, i.e. increase running costs. But not completely out of the question either.

ScootRay commented 3 years ago

It may just be me but I'm not sure it's the best thing to do. Anyone can go to WIkipedia to dig up stuff and it would seem redundant. My biggest concern is having to wade through wiki material when I don't want to in the first place. It's very hard to find highly relevant and highly focused search engines that cover specific areas, so Wikipedia may dilute that value.

Just my humble thoughts : )

Ray

m-i-l commented 3 years ago

Thanks for your feedback.

Search results on searchmysite.net are grouped by domain, so if indexing wikipedia, the "worst" that could happen (from the user's perspective) is that there's one extra group of results for every search query (and from my perspective the worst that could happen is that it doubles the running costs).

I do still think it could make it a search engine people would be prepared to use on a slightly more frequent basis, e.g. instead of looking something up on wikipedia, look it up on here to get the wikipedia link and see if anyone has written anything interesting about the topic.

And I still like the idea of ultimately turning it into a more general purpose search engine, with the crucial differentiator that it only searches the useful and interesting parts of the web, potentially still under the "personal and independent websites" categories (although given the amount of time I've been able to spend on it recently that's probably a fair way off).

If it sounds like I'm trying to talk myself into doing this, I have to admit it is partly because indexing wikipedia is also simply an itch I'd like to scratch:-) I could always remove it if it wasn't useful.

BTW, some people have talked about liking the idea of a search where you can suppress results from certain domains, so that could be an option, although it may require user profiles to be more useful, and that's something I'm trying to avoid.

m-i-l commented 2 years ago

Written and deployed the bulk import scripts, added indexing_type column to tblIndexedDomains to allow for different forms of indexing, set indexing_type to 'spider/default' for everything indexed at the moment and updated indexing and management scripts accordingly, and moved wikipedia.org from tblExcludeDomains to tblIndexedDomains with an indexing_type of 'bulkimport/wikipedia'.

Also increased the Solr Java memory, decreased the number of concurrent sites that can be indexed (to reduce memory), and removed a CPU/memory intensive clause from the boost (relevancy tuning).

Now wikipedia.org is in the tblIndexedDomains, it will mean sites being indexed or reindexed will have wikipedia links included in their indexed_outlinks, so when wikipedia itself is indexed it can determine the correct indexed_inlinks (and indexed_inlink_domains etc.), for the PageRank-like relevancy tuning. It'll take 28 days for all sites to be reindexed naturally, i.e. without a forced reindex.

ScootRay commented 2 years ago

Sounds good. Glad you are still working on this, great software, Thanks!

m-i-l commented 2 years ago

Sounds good. Glad you are still working on this, great software, Thanks!

@ScootRay Many thanks for your support. It is always great to hear from users, especially when it is positive feedback.

m-i-l commented 2 years ago

The Wikipedia indexing script itself is automated, checking what Wikipedia export was used for the last import and whether there is a new one available. However, it does require around 150Gb storage while it is running, which is a lot to be paying for when unused, so I'm increasing storage while it is running and decreasing storage afterwards, and this is not something I have automated. So the script is run manually rather than via a scheduled job.

I hope to write a blog post with more information on how it all works in the next 2-3 weeks. Until then, closing this as complete.

m-i-l commented 2 years ago

Blog entry with plenty of further details at https://blog.searchmysite.net/posts/searchmysite.net-now-with-added-wikipedia-goodness/ .