ipfs / apps

Coordinating writing apps on top of ipfs, and their concerns.
60 stars 9 forks source link

Wikidata Viewer #27

Open hobofan opened 8 years ago

hobofan commented 8 years ago

Wikidata contains a vast amount of structured, semantic data that can be useful for other IPFS apps. If someone wants to create a close clone of Wikipedia, this would also be required project since the backbone of the current Wikipedia is built on Wikidata (related to #17).

For now I have 2 main goals for this project:

Progress so far:

Thoughts:

Kubuxu commented 8 years ago

Currently you won't be able to add 18milion files into on directory but me and @Magik6k are currently adding https://cdnjs.com/ (about 22GB) to IPFS. The difference is that there is directory tree, as long that is you have less than about thousand of files or directories in one directory you should be good. Also we've found few performance bugs and are working on resolving them.

Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump.

Adding it to IFPS will take even longer so I don't think that task is feasible on C1.

hobofan commented 8 years ago

Also remember that adding files into IPFS means coping them so you will have to have double to storage capacity of the dump.

I planned for that and attached a 150GB volume for IPFS.

Adding it to IFPS will take even longer so I don't think that task is feasible on C1.

Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week?

Kubuxu commented 8 years ago

Does the time needed improve after a intial add? So would it drop from e.g. 12hr to 1hr when processing it the second week?

Yes it should, as it doesn't have then to save files to disk nor publish them to DHT (for the most part as they will be the same).

Most importantly this has to be resolved: https://github.com/ipfs/go-ipfs/issues/2823 as you will run out of the memory otherwise.

hobofan commented 8 years ago

Most importantly this has to be resolved: ipfs/go-ipfs#2823 as you will run out of the memory otherwise.

I guess that can be worked around by adding files in smaller batches and then restarting the server? (I assume this is a server and not a CLI bug?)

So with that workaround and some layered directory scheme, it should be possible to get it at least somewhat working? I am not in a rush since this is a weekend project, so it would even be okay for me if the initial add takes the whole week :sweat_smile:

Edit: As for performance, would there be any benefit to using master compared to 0.4.2.?

hobofan commented 7 years ago

Had a bit of time to get back to the project. After trying to solve the part where I split up the files with a bash script and standard tools I ended up writing a small Rust program that splits up the large weekly dump and places the entities in sharded directories: https://github.com/hobofan/wikidata-split

I am now starting to add the entities to IPFS and think I am experiencing https://github.com/ipfs/go-ipfs/issues/2828 . After adding about 50MB I have a bandwidth of TotalIn: 7.3GB and TotalOut: 1.6GB.

jbenet commented 7 years ago

Try with go-ipfs master. Should be down by a large factor On Thu, Aug 25, 2016 at 18:22 Maximilian Goisser notifications@github.com wrote:

Had a bit of time to get back to the project. After trying to solve the part where I split up the files with a bash script and standard tools I ended up writing a small Rust program that splits up the large weekly dump and places the entities in sharded directories: https://github.com/hobofan/wikidata-split

I am now starting to add the entities to IPFS and think I am experiencing ipfs/go-ipfs#2828 https://github.com/ipfs/go-ipfs/issues/2828 . After adding about 50MB I have a bandwidth of TotalIn: 7.3GB and TotalOut: 1.6GB .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ipfs/apps/issues/27#issuecomment-242562201, or mute the thread https://github.com/notifications/unsubscribe-auth/AAIcoWN2ajl4Aybe4oaWh3AKxTBgEfe2ks5qjhV2gaJpZM4IzvyG .

hobofan commented 7 years ago

As a weekend project, I learned some react.js and made a basic viewer for this project: https://github.com/hobofan/ipfs-wikidata-ui

It has still a lot of rough edges (see the open issues) but since I might not get too much time the next few weeks I wanted to put it out there. Anybody reading this, feel free to join in! :wink:

As for progress of adding the dataset, I am at 3.89 GB / 76.56 GB with the add process dying every ~1.5GB. I might be hitting https://github.com/ipfs/go-ipfs/issues/2823 there (see https://github.com/ipfs/go-ipfs/issues/2823#issuecomment-242989271). I think I should also mention that I am not on the Scaleway C1 mentioned in the first comment anymore, but switched to 32GB quad-core i7 root server (https://www.hetzner.de/ot/hosting/produkte_rootserver/ex40).

hobofan commented 7 years ago

The first complete publish of the dataset is finished! I am now tracking those at https://github.com/hobofan/wikidata-split/issues/2 . Next step on the dataset side is now to do the whole thing again with the current dump to see how long the diff takes to publish, and judge if that's maintainable or not (and automate as much as possible).