ipfs / distributed-wikipedia-mirror

Putting Wikipedia Snapshots on IPFS
https://github.com/ipfs/distributed-wikipedia-mirror#readme
630 stars 55 forks source link

How to search for articles? #76

Open SeanPedersen opened 3 years ago

SeanPedersen commented 3 years ago

I am wondering how to use ipns://en.wikipedia-on-ipfs.org/wiki/ effectively? I see no option to search for an article. How am I supposed to find the content I am looking for?

mburns commented 3 years ago

Search isn't directly supported, but theoretically could be in the future.

One options is to search for site:en.wikipedia-on-ipfs.org SEARCH TERMS in your preferred search engine to discover new pages:

lidel commented 3 years ago

We have some prior art in #44 Code is 4 year old but could be a good starting point if someone has bandwidth to help with this.

lidel commented 3 years ago

In case somoneone wants to pick this up before I have spare bandwidth: simply re-use existing UI from mobile Wikipedia: https://en.m.wikipedia.org/, which already has subtle branding + search box:

2021-03-10--13-22-22

Hamburger menu could be replaced with our icon, and clicking on it would jump to the footer explaining the mirror project.

RuiNtD commented 3 years ago

Both Google and DDG have methods of adding a custom website search bar to your website:

https://cse.google.com/ https://duckduckgo.com/search_box

I tested both out and unfortunately, the results I'm getting with DDG are all 404 errors because it's putting .html at the end of URLs. If you want to try both out, here are some links:

https://cse.google.com/cse?cx=230751f5750677644 https://duckduckgo.com/search.html?site=en.wikipedia-on-ipfs.org&prefill=Search%20Wikipedia%20on%20IPFS

EDIT: It's also worth noting that both engines do have ads above actual results. It is possible to remove ads (and branding) on DDG with URL params, but it's against ToS unless used for personal use.

ngbrown commented 3 years ago

There has recently been some work on hosting a full-text search engine in WebAssembly for very large data sets. This was directly influenced by IPFS's hosting of Wikipedia.

The key feature is to pull only the data needed from the static index to the client to execute the search. For example, doing a full text search on an index of size 14 GByte takes 2 seconds, and only needs to download only ~1.5MByte of the index.

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust. It was initially designed to run on a server, but Rust can run in WebAssembly. There is a pull request https://github.com/tantivy-search/tantivy/pull/1067 that adapts Tantivy to running fully in WebAssembly on the client side.

See the pull request for a demo using the Wikipedia dataset.

fulmicoton commented 3 years ago

(tantivy creatomaintainer and quickwit CEO) quickwit (https://github.com/quickwit-inc/quickwit) aims precisely at allowing client-side search on a distant high latency storage. We are in the process of opensourcing our code under the AGPL license. Once this is done. We'd be happy to help.

unbeatable-101 commented 3 years ago

Speaking of which, what is the code of distributed-wikipedia-mirror licensed under? Because if it isn't GPLv3 too it won't be able to use quickwit

johnsonjsyuen commented 3 years ago

This shows how it can be done with a static sqlite database that serves as the index. Sqlite supports full text search. Sqlite static hosted

DmitriyShepelev commented 2 years ago

In Brave Browser, you can create a keyboard shortcut for text that will prefix whatever you type after activating said keyboard shortcut, which can be used to search for IPFS Wikipedia pages. In Brave, go to "Settings > Search engine > Manage search engines and site search > Add", which will prompt you with a dialog box to add a search engine. For example, if you want to use Brave's search engine to search for IPFS Wikipedia pages, you can input https://search.brave.com/search?q=site%3Aen.wikipedia-on-ipfs.org %s for the URL with %s in place of query field (and whatever you want for the Search engine and Shortcut fields).