gnosygnu / xowa

xowa offline wiki application
Other
381 stars 40 forks source link

Download all original images for English Wikipedia (was "commons.wikipedia size?") #58

Closed Ope30 closed 8 years ago

Ope30 commented 8 years ago

I'm about to download commons.wikipedia.org, but I need to know the exact file size. I heard it's apparently 22TB, is that true? If so, I do not have the required amount of space to download it. I hope it is way less than that.

gnosygnu commented 8 years ago

Hi! XOWA doesn't download the images for commons.wikimedia.org. It only downloads the wikitext: https://dumps.wikimedia.org/commonswiki/20160601/ . This will allow you to have a functioning wiki of commons.wikimedia.org so that you can click around from page to page. Just no images. Note that this number is still large: 25 GB wikitext + 30 GB categories

If you really want the full commons.wikimedia.org dump, you can download it from archive.org. See https://en.wikipedia.org/wiki/User:Emijrp/Wikipedia_Archive#Image_tarballs . Note that the actual number is approximately 34 TB in size

Here are some other details:

Hope this helps! Let me know if there is anything else.

Ope30 commented 8 years ago

Thanks for you answer, I appreciate it.

By the way, is there a way to get the only images Wikipedia is using from commons.wikipedia, with better quality? You know, when you click on an image, you get half the size of the original. Overall, I just want to download the images Wikipedia is using off commons.wikipedia., not the whole site, just the ones Wikipedia is using, so you get the full size of the picture.

gnosygnu commented 8 years ago

Unfortunately, Wikimedia doesn't really provide a way. They used to have image dumps per wiki but stopped around 2014. If you want, you can explore your.org:

I've thought about having XOWA do it, but haven't gotten around to it. The demand is low (you're the 3rd person who's asked) and it requires a lot of bandwidth.

Otherwise, your best bet might be a project called WP-Mirror: http://www.nongnu.org/wp-mirror/. This attempts to recreate the entire Wikipedia site on your machine. However, it's still beta, and it takes a long time (I think 30+ days).

Hope this helps.

Ope30 commented 8 years ago

I think I preferably leave it the way it is. I would rather wait until you implement this kind of feature, IF. This would be so fucking awesome though. I mainly use your programm because of it's images. I just love it. I really hope you'll find a way to implement this in the future.

gnosygnu commented 8 years ago

Cool. Thanks for the feedback.

One other thing I forgot to mention. Users seem to want to download less, not more. I've been told that 90 GB for English Wikipedia is too much. So you can understand how the incentive diminishes to provide 2.4+ TB of originals.

I'm working on a new download tool for wikis. This will try to standardize the images in each database (database 1 will have images with ids 1, 12, 15, 20; database 2 will have images with ids 2, 14, 16, 22; etc). If that goes well, I may create "original" databases. It's a nice thing for me to have as well. But again, the priority will probably be on the low side -- or at least until I can get a more reliable file distribution set-up (file server; bittorent). Uploading 2.4 TB to archive.org will be very painful.

Hope this helps. Let me know if there's anything else. Otherwise, I'll change the subject of the ticket to "Download all original images for English Wikipedia" and label it as "future enhancement"

Thanks.

gnosygnu commented 8 years ago

This item is a future enhancement, but I'm going to mark this item closed for now. When I'm ready to start on it, I'll reopen it again. Thanks