ipfs-inactive / archives

[ARCHIVED] Repo to coordinate archival efforts with IPFS
https://awesome.ipfs.io/datasets
183 stars 24 forks source link

Wikipedia #20

Open davidar opened 9 years ago

davidar commented 9 years ago

In terms of being able to view this on the web, I'm tempted to push Pandoc through a Haskell-to-JS compiler like Haste.

CC: @jbenet

rht commented 9 years ago

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/  # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync
dump2html -r dump/
ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal markup viewer)

davidar commented 9 years ago

That's also a possibility, but more time consuming and inflexible

On Thu, 17 Sep 2015 11:29 rht notifications@github.com wrote:

In this case, why does the xml -> html have to be done client-side?

In the archiver's machine

get-dump dump/ # using any of the tool in https://meta.wikimedia.org/wiki/Data_dumps/Download_tools, there is one with rsync dump2html -r dump/ ipfs add -r dump/ # and ipns it

(although yes it'd be much convenient to just use pandoc as a universal markup viewer)

— Reply to this email directly or view it on GitHub https://github.com/ipfs/archives/issues/20#issuecomment-140939548.

David A Roberts https://davidar.io

DataWraith commented 9 years ago

I actually started on this a while ago, but then thought it would be silly for a single person to attempt this and stopped, but now that I see this issue, I think it might not have been such a bad idea:

I've been experimenting with using a 15GiB (compressed and without images) dump of the English Wikipedia and extracting HTML files using gozim and wget. This gave me a folder full of HTML pages that interlink nicely using relative links.

It took a couple of hours to extract every page reachable from 'Internet' within 2 hops, which amounted to about 1% of the articles in the dump, so it would take at least a week to create HTML pages for the entire dump. And since these HTML files are uncompressed, I'm not sure I have enough disk space available to do the complete dump, but I could repeat my initial trial and make it available in IPFS.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates. If it is decided that this way of doing it might not be such a bad idea, it might be possible to alter gozim to embed such license information. Or maybe we can simply put a LICENSE-file in the top-most directory.

davidar commented 9 years ago

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Or maybe we can simply put a LICENSE-file in the top-most directory.

Definitely. See #25

DataWraith commented 9 years ago

@DataWraith Just had a look at the gozim demo, looks really cool. In the short-term, this does seem like the best option (apologies for my terse reply earlier @rht :). Would it also be possible to also do client-side search with something like https://github.com/cebe/js-search ?

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files. However, resource usage on the client may or may not be prohibitively large.

I'm not sure I have enough disk space available to do the complete dump

If you can give me a script, and an estimate of the storage requirements, I can run this on one of the storage nodes for you :)

There is no real script. It's literally:

  1. gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
  2. wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

This will crawl everything reachable from 'Internet'. It may be possible to directly crawl the index of pages itself, but I haven't tried that yet.

You probably need to wrap gozimhttpd in a while loop, because it tends to crash once in a while. As for storage requirements: The 60.000 articles I extracted take up 5GiB of storage, so a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

One problem I see with this approach, is that the Creative Commons License requires attribution, which is not embedded in the HTML files gozim creates.

Are you sure? I can see:

This article is issued from Wikipedia. The text is available under the Creative Commons Attribution/Share Alike; additional terms may apply for the media files.

in the footer of http://scaleway.nobugware.com/zim/A/Wikipedia.html

Hm. Maybe that's because they are using a different dump, or a newer version of gozim (though the latter seems unlikely); the pages I extracted don't have that footer.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

davidar commented 9 years ago

I'm no JavaScript expert, but I don't see why not. We could pre-compile a search index and store it alongside the static files.

For context, this is what @brewsterkahle uses for his IPFS-hosted blog

However, resource usage on the client may or may not be prohibitively large.

Yeah, that was my concern too. If so, it might have to wait until #8

There is no real script. It's literally:

gozimhttpd -path <wikipedia-dump> -port 8080 -mmap
wget -e robots=off -m -k http://localhost:8080/zim/A/Internet.html

Too easy

a full dump of the 5.000.000 articles in the dump is probably on the order of 500GiB.

Ok, we'll have to wait until we get some more storage then.

I'm currently running ipfs add on the pages I have extracted, to get a proof-of-concept going. It's inserting the pages alphabetically, but it tends to crash around the 'D's, with an unhelpful 'killed' message. Possibly ran out of memory.

Thanks. Ping me on http://chat.ipfs.io to help debug.

DataWraith commented 9 years ago

Short progress update: I'm now feeding files to ipfs add in batches of 25, that seems to have solved the memory issue for now. I hope that feeding in the files piecemeal will prevent the crash that occurs when adding the entire directory at once. I'll probably be able to try adding the entire thing again tomorrow.

I also took another look at gozim. It is relatively easy to extract the HTML-files without going through wget first -- should've thought of that before coming up with the wget-scheme. That way we won't miss any articles; I'll have to do more research on redirects though.

Quick & dirty dumping program here.

DataWraith commented 9 years ago

I had no luck getting ipfs add to ingest the HTML files; pre-adding the files in batches didn't do anything. ipfs (without the daemon running) consumed enough RAM to fill a 100GB swap file and then crashed with an error, runtime: out of memory. A script I wrote to add files one by one using the object patch subcommand was too slow, taking 3 to 5 seconds for a single page, so I abandoned that approach.

There are two related issues describing problems with ipfs add. I'll try again once those are resolved.

davidar commented 9 years ago

@DataWraith Hmm, that's no good :confused:. For the moment, could you tar/zip all the files together and add that?

CC: @whyrusleeping

DataWraith commented 9 years ago

Hi.

I've decided to delete the trial-files obtained using wget and go all out and try to actually dump the entire most-recent English Wikipedia snapshot (with images) with my program. It's currently in the 'D's (1.3 million articles done) and I estimate it will finish in another 60 to 70 hours. I'll try adding the dump using the undocumented ipfs tar add, which did not seem to blow up memory-wise in the small trial I did. Not sure why that would be different from the normal ipfs add, but apparently it is. If that still fails, I'll run the tar-archive through lrzip and upload that.

My initial estimate of space required was off, because the article sample I obtained using wget did not contain the small stub articles, of which there are many. The 1.3 million articles I have now add up to 40GiB, so, assuming that the distribution of article sizes is not skewed, we are looking at an overall size of about 160GiB plus maybe another 40GiB for the images. In addition, I'm using btrfs to store the dump, and its built-in compression support halves the actual amount of data stored, so size should not be a problem.

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

davidar commented 9 years ago

@DataWraith Awesome, can't wait to see it :)

Edit: ipfs tar add is not much faster than the custom script I had cobbled together earlier. At 3 to 5 seconds per file, it'd take the better part of a year to add the entire dump. :/

@whyrusleeping Please make ipfs add faster :pray:

rht commented 9 years ago

@whyrusleeping

For scale (foo/ is 11 MB, 10 files of 1.1 MB each):

It appears that cp doesn't have an explicit call to fsync in its implementation https://github.com/coreutils/coreutils/search?utf8=%E2%9C%93&q=fsync. (I think it's fine to not have explicit sync call?)

whyrusleeping commented 9 years ago

@davidar @rht okay, I'll make that top priority after UDT and ipns land.

rht commented 9 years ago

(git does explicit sync https://github.com/git/git/blob/master/pack-write.c#L277 edit: but only on pack updates)

rht commented 9 years ago

@davidar I get you point, which either mean 1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be). Though it is currently slow (e.g. pandoc pdf to html << (or maybe ~) pdf.js << browser plugin for pdf).

As with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

rht commented 9 years ago

I wonder if some of the critical operations should be offloaded to FPGA.

davidar commented 9 years ago
  1. "if someone can put the kernel on the browser, why not pandoc", or 2. "we need to be able to do more than just viewing static simulated piece of paper" (more of what a "document"/"book" should be).

Uh oh, which side of this argument am I on now? #25 @jbenet

with the client-side search, it works for small sites, but for huge sites (wikipedia?), transporting the index files to the client seems to be too much.

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

rht commented 9 years ago

The idea is that you'd encode the index as a trie and dump it into IPLD, so the client would only have to download small parts of the index to answer a query.

And this can be repurposed for any 'pre-computed' stuff, not just search indexes? e.g. (content sorted/filtered by paramX, or entire sql queries https://github.com/ipfs/ipfs/issues/82?)

davidar commented 9 years ago

@rht yes, I would think so, I don't see any reason why it wouldn't be possible to build a SQL database format on top of IPLD (albeit non-trivial)

davidar commented 9 years ago

@rht looks like someone already beat me to it: http://markup.rocks

rht commented 9 years ago

@davidar by a few months. Very useful to know that it is fast. Currently imagining the possibilities.

Also, found this http://git.kernel.org/cgit/git/git.git/tree/Documentation/config.txt#n693:

This is a total waste of time and effort on a filesystem that orders data writes properly, but can be useful for filesystems that do not use journalling (traditional UNIX filesystems) or that only journal metadata and not file contents (OS X's HFS+, or Linux ext3 with "data=writeback").

@whyrusleeping disable fsync by default and add a config flag to enable it? (wanted to close the gap with git, which is still 2 orders of magnitude away).

davidar commented 9 years ago

Very useful to know that it is fast.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well. The FP Complete IDE is also written in a subset of Haskell.

Currently imagining the possibilities.

Something like the ipfs markdown viewer but using pandoc would be cool.

davidar commented 9 years ago

IPFS-hosted version of markup.rocks: https://ipfs.io/ipfs/QmSyfirfxBbgh8sZPzy4yyMQjHgzKX7iQeXG9Zet4VYk9P/

rht commented 9 years ago

@davidar saw it, neat. i.e. it's a pandoc but without the huge GHC stuff, cabal-install ritual, etc. It's a pandoc.

Yeah, Haskell is high-level enough that it tends to compile to JS reasonably well.

But so does python, ruby, ... You mean sane type system? https://github.com/faylang/fay/wiki says fay doesn't have GHC's STM, concurrency--which is fine.

This has nice things like:

Additionally, because all Fay code is Haskell code, certain modules can be shared between the ‘native’ Haskell and ‘web’ Haskell, most interestingly the types module of your project. This enables two things: The enforced (by GHC) coherence of client-side and server-side data types. The transparent serializing and deserializing of data types between these two entities (e.g. over AJAX).

(haven't actually looked at a minimalist typed :lambda: calculus metacircular evaluator (the one people write (or chant) every day for the untyped ones))

davidar commented 9 years ago

... You mean sane type system?

Yeah, I meant of the languages with a strong enough type system to be able to produce optimised code

davidar commented 9 years ago

Also see this simple but awesome wiki editor by @jamescarlyle

jamescarlyle commented 9 years ago

The source of the v.basic wiki editor referenced by David is at https://github.com/jamescarlyle/ipfs-wiki

rht commented 9 years ago

How to make this work? the text I typed didn't show up.

jamescarlyle commented 9 years ago

@rht, sorry, I posted it without any public testing. I've added the briefest of READMEs to the GH repo - specifically, "There is a current dependency on a local daemon listening on port 5001 (this is the default port for the IPFS daemon), in order to both fetch content and save changes. This means that the IPFS gateway used to serve the js also needs to use the same protocol, i.e. http rather than https." So running a daemon and serving locally should be fine. Will get to running via a public gateway in due course; sorry about that.

rht commented 9 years ago

Yes I did use local daemon and the link http://localhost:8080/ipfs/Qmb2ymoF197UWEaDiAcHZFQKcj2nMmPTs2xRgVCb2nerdx/#/QmeKV1ptkqEeishBTW7twN5ij4VesAB6M7EEnGh5YdjUPf/TitlePage.

I must be missing something here.

DataWraith commented 9 years ago

Hi all,

the Wikipedia dump is finished. I packed it into a single .tar-file weighing in at 176GB, which lrzip then compressed down to 42GB. My internet connection, while decent, will still take its time to upload that much data; I'll edit this post with a Dropbox link to the file once the upload is done.

Edit: Dropbox link: https://www.dropbox.com/s/7ut0g1mdbwuq393/wikipedia_en_all_2015-05.tar.lrz?dl=0

jbenet commented 9 years ago

Maybe should be put directly to one of our storage nodes with scp.

davidar commented 9 years ago

Maybe should be put directly to one of our storage nodes with scp.

@DataWraith let me know if there's anything I can do to help with this

CC: @lgierth

rht commented 9 years ago

(ic, for the ipfs-wiki, I was blocked by cors...)

davidar commented 9 years ago

@DataWraith Awesome, downloading now :)

davidar commented 9 years ago

@DataWraith And now it's on IPFS :balloon:

@whyrusleeping Looking forward to ipfs add being fast enough to handle the extracted version ;)

DataWraith commented 9 years ago

@davidar Awesome!

whyrusleeping commented 9 years ago

@davidar its very high on my todo list.

davidar commented 9 years ago

@whyrusleeping :heart:

rht commented 8 years ago

This can proceed with https://github.com/ipfs/go-ipfs/pull/1964 + https://github.com/ipfs/go-ipfs/pull/1973 merged (pending @jbenet's CR). nosync is still not sufficient.

davidar commented 8 years ago

@rht that's awesome :). Are you also testing perf on spinning disks (not just SSDs)? It seems to be the random access latency that really kills perf

Edit: also make sure the test files are created in a random order (not in lexicographical order)

rht commented 8 years ago

The first reduces the number of operations needed (including disk io), so will make add on HDD faster. For the second, channel iterators in golang has been reported to be slow (but I'm not sure of its direct impact on disk io), so should make add on HDD faster.

jbenet commented 8 years ago

on it! (cr)

DataWraith commented 8 years ago

I'm trying out those pull requests on the Wikipedia dump right now. ipfs tar add still crashed with an out-of-memory error, but plain ipfs add -r -H -p . is chugging along nicely. It's been running for almost 12 hours now, so hopefully it's not going to crash.

It has added the articles starting with numbers, and is now working on the articles starting with A, so it'll be a while until the whole dump is processed.

jbenet commented 8 years ago

@DataWraith thanks, good to hear -- btw, dev0.4.0 has many interesting perf upgrades, with flags like --no-sync which should make it much faster.

dignifiedquire commented 8 years ago

ipfs add is mich faster in 0.4 maybe we can revisit this and try to setup a script to constantly update the mirrored version in ipfs

eminence commented 8 years ago

Instead of working with the massive Wikipedia, I've been playing with the smaller, but still sizable Wikispecies project. It has 439,460 articles, and is about 4.5 GB on disk.

I've imported the static HTML dumps from the Kiwik openzim dump files. The dump to disk took less than 10 minutes, and the import into ipfs (with ipfs040 with Datastore.NoSync: true) took about 3 or 4 hours.

It's browsable on my local gateway, but I've not been able to get the site to load on the ipfs public gateways. Can any of you try?

http://localhost:8120/ipfs/QmbZp1H1mCbVSiD2K8xpFFhzRGoLJTU6E4keY9WQpyuxP1/A/index.htm

(edit Jan 14th -- after upgrading my nodes to master branch, I stopped running my dev040 node, so this hash is no longer available. Stay tuned for updates)

davidar commented 8 years ago

I've not been able to get the site to load on the ipfs public gateways

Same :/

eminence commented 8 years ago

Ok, here is my next iteration on this project :

http://v04x.ipfs.io/ipfs/QmV6H1quZ4VwzaaoY1zDxmrZEtXMTN1WLJHpPWY627dYVJ/A/20/8f/Main_Page.html

This is also an IPFS-hosted version of Wikispecies, but with one major change:

Instead of having every article in one massive folder, each article has been partitioned into sub-folders based on the hash of the filename. For articles, there are two levels of hashing, and for images there is one level of hashing.

The goal of this is to reduce the number of links in the A/ and I/m nodes, since they appeared to be too large to load via the public IPFS gateways. I think in this regard, this has been successful.

However, there still seem to be some issues. As I browse around the Main_Page.html link (see above), sometimes the page will load quickly and instantly. Other times, images will be missing, the page will load slowly, or maybe even not at all. This is true even for pages that I've visited already (and thus should be in the gateway's cache)

I can't really tell what's going on here. Running ipfs refs on these hashes from another node of mine works pretty flawlessly. So I conclude the problem might not be with my node. But I'm not sure what other debugging tricks I can use to get to the bottom of this. I think this is a fairly important issue to resolve.

Finally, here are the two tools I wrote in the process of working on this:

zim dumping takes a few minutes, wiki_rewriting takes less than an hour, and ipfs add -r probably took a few hours. in all cases, i appear to disk-io bound

whyrusleeping commented 8 years ago

@eminence this is great! It also further emphasizes the fact that we need to figure out directory sharding. I'll think on this today and see what I come up with.

Keep up the good work :)