ipfs / distributed-wikipedia-mirror

Putting Wikipedia Snapshots on IPFS
https://github.com/ipfs/distributed-wikipedia-mirror#readme
632 stars 54 forks source link

Set up collaborative pinning clusters #68

Closed lidel closed 3 years ago

lidel commented 4 years ago

ipfs-cluster 0.12.0 shipped collaboratiove clusters, and there is a website with some demo datasets:

Collaborative clusters are public IPFS Clusters that anyone can join to help replicating and re-distributing content on the IPFS network. – https://collab.ipfscluster.io

TODO

lidel commented 4 years ago

@hsanjuan are you able to provide some thoughts on this?

My hope is to automate things so collaborative cluster so it gets updated every time snapshot-hashes.yml in master of this repo changes (we can change the file format, its not parsed by anything atm), but semi-automatic process could be ok for now too.

RubenKelevra commented 4 years ago

@lidel it's pretty simple. If you have a machine with the data pinned in ipfs, you can add the ipfs-cluster.

You need ipfs-cluster-service and ipfs-cluster-ctl.

The cluster needs to be initiated:

ipfs-cluster-service init --consensus crdt

then the cluster service needs to run continuously as daemon:

ipfs-cluster-service daemon

Note that this machine is by default the only machine able to change the data in the cluster (trusted peer). You can add other trusted peers later.

You can now go ahead and pin a CID to the cluster:

ipfs-cluster-ctl pin add [command options] <CID|Path>

There are command options avaible which defines how often the data should be pinned in the cluster etc.

I think in the end we shard the data in the cluster, meaning that the content is split into multiple pins and this pins will get a minimum and maximum number of replications.

RubenKelevra commented 4 years ago

When we add the current data, we should change the chunker to rabin, which allows for diff updates.

We can also set the --expire-in flag, to unpin old versions of pages and files automatically, like after a year? If someone likes to archive the older versions, than this can be done by pinning them in IPFS.

In the future we could modify the pages a bit and offer a bit of a history, like adding a link to the previous version of the file. When we update to a new version, we just add a static link which will lead to the version before.

When the cluster has automatically unpinned the file they remain available until the last IPFS node has unpinned them and cleared it out of the cache.

The most efficient way to update pins is by adding the new version with IPFS, and then trigger an update of the cluster with ipfs-cluster-ctl pin update. This will push the update to the same nodes of the cluster which had the file before pinned. This reduces the amount of data necessary to safe the new file, since the node has already the old version and can save a diff.


We just have to consider that certain files or pages will not be updates for a long time:

RubenKelevra commented 4 years ago

Cluster setup should better wait a while, until the development of IPFS has advanced a bit:

https://github.com/ipfs/go-ipfs/issues/6841#issuecomment-576811851

If we get a bunch of people together to be part of the project, it's best to have already all features we like to have at this point, avoiding that people have to upgrade or we to restore the already published informations.

@lidel hope you share the opinion :)

lordcirth commented 4 years ago

go-ipfs 0.5.0 is going to have a lot of changes in general, so waiting at least for that is probably sensible. There's also the upcoming switch to Badger 2.0, which could potentially be difficult to migrate individual nodes, but doesn't affect the network as a whole.

https://github.com/ipfs/go-ipfs/issues/6776

RubenKelevra commented 4 years ago

@lordcirth can Badger 2.0 be compacted when running an ipfs-cluster? Since the cluster has to unpin older versions after a while for storage considerations. But everyone can pin the old versions and keep them alive in the network.

(that's my current mindset - others might think differently about this!)

lordcirth commented 4 years ago

You mean garbage collection? I expect it should behave pretty much like Badger 1.0 did from the perspective of the user, and have all the same features.

RubenKelevra commented 4 years ago

Garbage collection

Well, no - the compaction of badger itself below IPFS. Since for v1 there was some issues reported, for example this one:

https://github.com/salesforce/sloop/issues/90

hsanjuan commented 4 years ago

@lidel since these are BIG (how big btw?), I'll need to deploy a big machine. Then we can automate adding the pin from the CI on this repo. I'll let you know

RubenKelevra commented 4 years ago

@hsanjuan we could add each page/picture as individual pin with a minimum and maximum amount of replications. This way not each cluster member need to hold the full amount of data and for each new pin the data is stored on the cluster members which got the most free space available.

The en version of 2017 is just shy of 1 TB (don't know the exact number, but it's quite big). If you consider the grow rate and that we want to hold the 13 languages with the most articles (see https://github.com/ipfs/distributed-wikipedia-mirror/issues/63 ) we need considerably more storage as a single machine can easily hold (servers with a large amounts of harddrive slots gets pretty expensive fast).

hsanjuan commented 4 years ago

@hsanjuan we could add each page/picture as individual pin with a minimum and maximum amount of replications.

Unfortunately, on a collaborative cluster this poses a lot of problems: participants are not trusted to actually be holding the content, not stable etc...

RubenKelevra commented 4 years ago

@hsanjuan we could add each page/picture as individual pin with a minimum and maximum amount of replications.

Unfortunately, on a collaborative cluster this poses a lot of problems: participants are not trusted to actually be holding the content, not stable etc...

True. But I expect that people who are willing to share space and traffic for such a project will do this because they want the project to succeed.

The cluster will always try to replicate more copies if there are not pin-max available, so if there are some machines going in and out, there's no issue at all. (As far as I understand this functionality).

Especially if we list some minimum requirements I think many people will join such a project, as long as we don't require like 20 TB storage.

This will greatly increase the distributed character of this project and would fit exactly what Wikipedia likes to achieve: Everyone can participate.

Suggestions for minimum requirements: -Constant internet connection -24/7 runtime -minimum bandwidth x Mbit/s -unlimited traffic

Additionally we need to organize the data pinned in the cluster in files/folders anyway, to make them accessable via paths rather than cids.

If you like to replicate a full copy, you're still able to do this, just locally pin the root folder of the language you like to cache.

hsanjuan commented 4 years ago

My idea is to have a separate cluster per language to start with. Many languages are small enough for anyone to keep a copy..

RubenKelevra commented 4 years ago

My idea is to have a separate cluster per language to start with. Many languages are small enough for anyone to keep a copy..

This might result in poor availability of certain languages of not so well developed countries.

I would prefer to do a single setup, this way we reduce the setup time/maintenance time we can add additionally languages when we're completed testing the system, without major changing on the follower systems.

Additionally the same media files are often used in different language versions. We can avoid storing them individually creating just a single cluster.

hsanjuan commented 4 years ago

Additionally the same media files are often used in different language versions. We can avoid storing them individually creating just a single cluster.

Ok, another question here is, if we pinned each object individually, how many objects would that be?

RubenKelevra commented 4 years ago

One per Page, one per Category, one per Image I guess.

Doesn't really matter, we're currently doing the same thing with the static mirrors.

The difference is that each file will be pinned individually instead of by reference to be able to run 'ipfs-cluster-ctl pin update' to push the a new version of an article to the same nodes who stored it before for efficient deduplication.

Best case scenario would be to get the latest edits from the special page of Wikipedia and generate the new version and push it to the cluster. Or if we want less updates just get the list and process it after 24h.

This way we can merge articles edits which has been done within 24h to reduce the update rate.

The new folder CID can be pushed to the ipns after we we completed our list of updates.

Old versions can be marked with a pin timeout, to allow access to older versions for a while (depending on how much space we have).

lordcirth commented 4 years ago

How well will ipfs-cluster handle millions of separate pins? Has this been tested?

RubenKelevra commented 4 years ago

@lordcirth we're here to test this, I guess.

I've currently working on a different cluster, after this work has been finished I thought about looking into setting this one up.

hsanjuan commented 4 years ago

While cluster might break with millions of individual pins (particularly on crdt mode), probably ipfs has its own share of troubles with that. Therefore I'm a bit reticent...

RubenKelevra commented 4 years ago

@hsanjuan well, how could we otherwise get this to work? 🤔

I mean IPFS can hold the current mirrored version, which are also quite a lot of files.

I just thought about using the inline feature of IPFS and crank the limit up to avoid referencing for small files.

hsanjuan commented 4 years ago

@RubenKelevra by having a single root hash pinned

RubenKelevra commented 4 years ago

@RubenKelevra by having a single root hash pinned

So we need to modify a MFS when stuff is updated and then pin the new root recursively?

Wouldn't this lead to a single pin which is over a terabyte, which needs to be hold by every cluster member completely? 🤔

hsanjuan commented 4 years ago

which needs to be hold by every cluster member completely?

Well yes. But it would work. The other ways will likely not work. But nevertheless, having a look at the numbers and at what is the way to add the wikipedia in the most compact form would awesome.

RubenKelevra commented 4 years ago

How about running a CRC32 over the article names/file names and using the first 4 numbers to make subfolders.

This way we chunk the data into ~65k pieces.

If we're talking about 10 TB each of those pins would end up using 160 MB.

Much better to spread around.

We probably still want to limit the concurrent pins to 1 or 2, to avoid extremely long pin downloads to cluster members.

hsanjuan commented 4 years ago

How about running a CRC32 over the article names/file names and using the first 4 numbers to make subfolders.

This way we chunk the data into ~65k pieces.

If we're talking about 10 TB each of those pins would end up using 160 MB.

Much better to spread around.

We probably still want to limit the concurrent pins to 1 or 2, to avoid extremely long pin downloads to cluster members.

Yeah, that sounds better. Still a cluster with a positive replication factor ("only replicate in X places") it is easily abusable by anyone reporting a lot of space for example, which will get allocated the content only to perhaps not pin the content at all. Unlike filecoin, cluster/ipfs cannot ensure the content is actually present in the places that claim to have it.

If you however have a level of trust in the participants (and they are stable enough), then that problem disappears.

RubenKelevra commented 4 years ago

How about running a CRC32 over the article names/file names and using the first 4 numbers to make subfolders. This way we chunk the data into ~65k pieces. If we're talking about 10 TB each of those pins would end up using 160 MB. Much better to spread around. We probably still want to limit the concurrent pins to 1 or 2, to avoid extremely long pin downloads to cluster members.

Yeah, that sounds better. Still a cluster with a positive replication factor ("only replicate in X places") it is easily abusable by anyone reporting a lot of space for example, which will get allocated the content only to perhaps not pin the content at all. Unlike filecoin, cluster/ipfs cannot ensure the content is actually present in the places that claim to have it.

If you however have a level of trust in the participants (and they are stable enough), then that problem disappears.

I got an idea how to solve this without THAT much effort. Should I open a ticket on ipfs-cluster repo?

Edit: I've wrote the idea down in a ticket. Hope you like it @hsanjuan.

My ideas tend to be somewhat lengthy... it's hard for me to be very precise AND short in English, since it's not my native language. Hope you don't mind.

https://github.com/ipfs/ipfs-cluster/issues/1004

RubenKelevra commented 4 years ago

which needs to be hold by every cluster member completely?

Well yes. But it would work. The other ways will likely not work. But nevertheless, having a look at the numbers and at what is the way to add the wikipedia in the most compact form would awesome.

@hsanjuan

I've looked again into this topic.

IPFS can store git objects, how about using this ability?

The most compact form would be to store the raw text of the article plus the true Title plus the URL.

This way we could not only preserve the articles but also the complete history including the authors (which is in my opinion worth the additional work - to give proper credit to them).

This would also allow us to update the mirror with a minimal amount of changes since we're only adding some git objects to IPFS.


To allow for more easy access I would recommend storing the latest version of the articles as text file in the ipfs as well. This means the browser accessing the ipfs mirror, would only need to understand a text resource instead of accessing a git.


The question comes to mind, how do we render the wiki texts then? Well, we could use the abandoned javascript-based editor/parser from the MediaWiki team and fork it to enable it to run it directly in a browser.

https://github.com/wikimedia/parsoid-jsapi

This way the page would be rendered by the browser and the data would be received from a text file from the ipfs.

This is similar to a what-you-see-is-what-you-get editor for Wikipedia, so the browser should be able to handle the conversion.


Additionally, there are media files. I'm not sure if you're aware, but there's a new JPG standard on the way, which might be very interesting for this project.

It's called JPEG-XL. JPEG-XL can store JPG files natively and compress them further without any loss in quality (since it's transparent) Demo here.

But much more interesting, we could transcode the JPG/PNG/GIF files from Wikipedia in JPEG-XL in the future. This would allow us to save just one big file, but receive just the first small parts of it to get smaller resolutions for the embedding in the articles. It supports RAW files natively up to 32 bit per channel, alpha channels, CMYK and even crazier stuff. It can also save animations efficiently and decode them with a lower resolution with a partly download. The format can store lossy and lossless, depending on your needs and create much smaller files than JPG/WebM/GIF/PNG in all cases.

Since JPEG-XL is backed up by Google, I expect that it lands in Chromium/Chrome pretty soon.

Using JPEG-XL would reduce our storage requirements in the cluster to a degree that saving all images used in the articles in native resolution (or at least a large resolution like 4k/8k would come in reach.

This way we can support that pictures can be clicked to enlarge them to the largest resolution we store, while we use the same file (but just the first part) to render the smaller size of the picture in the article.

All right, opinions? :)

Best regards

Ruben

Edit: forgot the link to parsoid-jsapi

lidel commented 3 years ago

Alright folks, let's revisit this now that we have updated Turkish (#60) and English is WIP.

@hsanjuan added Wikipedia to https://collab.ipfscluster.io and for now it will only have Turkish one, but when I manage to generate latest English, the list will grow.

I suggest we solidify some policies around this. Here is my proposal:

Does this sound sensible? Anything I've missed/misrepresented?

hsanjuan commented 3 years ago

It sounds sensible. We can re-adjust policies later if we decide so.

To smooth things out, we could adopt 1 month transition period, where both old and new versions are pinned

How many difference would there be between snapshots? It may be cheap to just let old ones pinned for a while. As policy, 1 month sounds good though.

lidel commented 3 years ago

How many difference would there be between snapshots?

Sadly I have no metrics on dedup of unpacked data.

lidel commented 3 years ago

Closing this as we now have Wikipedia section at https://collab.ipfscluster.io If we need more, feel free to open new issue.

@hsanjuan we recently updated en and added ar, ru and my do you mind updating the size on that page so people don't get surprised about size being way more than currently listed ~20GB? :pray: