gnosygnu / xowa

xowa offline wiki application
Other
375 stars 41 forks source link

downloading and converting images #358

Open desb42 opened 5 years ago

desb42 commented 5 years ago

Having successfully done the first step in generating html from a xowa build for dewiki ('wiki.mass_parse.exec' step - took approx 18h) I decided to complete the process to download the thumbnails the main step being 'file.fsdb_make' with a number of setup steps the whole group of steps took 9h 21m Reviewing the console log, I came across the following sequence

downloading Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg: 05m 41s left (@ 2.595 KBps); 1
downloading Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg: 00s left (@ 636.112 KBps); 554
downloading Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg: 00s left (@ 755.497 KBps); 850
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 0 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 1 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 2 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w
converting: 2 second(s); convert "g:\xowa\file\commons.wikimedia.org\orig\4\f\9\4\Women_working_a_factory_in_David_Stempel_AG_in_1918_during_world_war_one.jpg" -coalesce -resize 0x0 "g:\xowa\file\commons.wikimedia.org\thumb\4\f\9\4\Women_w

Under circumstances I do not understand, an original image is downloaded and an attempt to convert it to 0x0 size (21 times) is made

There was one other of similar sequence Luca_Carlevarijs_(Italian_-_Regatta_on_the_Grand_Canal_in_Honor_of_Frederick_IV,_King_of_Denmark_-_Google_Art_Project.jpg In this case trying to convert to a size of 9568x5161 (original is 9775x5273) This, in its original size is 14MB - not what I would call a small file

There were also 517 failed downloads - is there anyway of telling from which page(s) these came from

gnosygnu commented 5 years ago

Under circumstances I do not understand, an original image is downloaded and an attempt to convert it to 0x0 size

Yeah, this looks like a bug.

This should be a simple change. I'll also check the databases on my side.

There were also 517 failed downloads - is there anyway of telling from which page(s) these came from

Yup. Check xowa.file.make.sqlite3 and run the following SQL:

SELECT lnki_page_id FROM lnki_temp WHERE lnki_ttl = 'YOUR_IMAGE.PNG';

gnosygnu commented 5 years ago

So it looks like the above file (Women_working...) is the only instance of a no-op file:

SELECT * FROM lnki_temp WHERE lnki_w = 0 AND lnki_h =0 AND lnki_upright = -1 AND lnki_time = -1 AND lnki_page = -1;

Furthermore, it exists on this page: https://de.wikipedia.org/w/index.php?title=D._Stempel which uses it as:

[[Datei:Women working a factory in David Stempel AG in 1918 during world war one.jpg|alternativtext=Frauen arbeiten in der David Stempel AG während des 1. Weltkrieges, 1918|mini|0x0px|Frauen arbeiten in der David Stempel AG während des 1. Weltkrieges, 1918]]

Apparently, MediaWiki ignores the 0x0 argument. I'm going to track down this code later. However, as the impact is pretty low, I'm bumping this down in priority.

ktry commented 5 years ago

After manually updating a wiki, how does one download new images and locally create new image dumps to add to the last published image dump, e.g. Xowa_enwiki_2018-07file*.zip? I use Xowa (a simply amazing and awesome product) on a stand-alone network, so grabbing images while browsing is not an option.

gnosygnu commented 5 years ago

how does one download new images and locally create new image dumps

There really is only one way and it is quite complicated. See: http://xowa.org/home/wiki/Dev/Command-line/Dumps . I can walk you through it, but it requires quite a bit of work (I think @desb42 has managed to get through one enwiki cycle on his own)

Ordinarily, I try to provide updated copies of English Wikipedia. But I've been late on my side, though I keep saying that a new update is just around the corner....

I use Xowa (a simply amazing and awesome product) on a stand-alone network

Also, just want to say, thanks for the compliment!

ktry commented 5 years ago

I am simply awed by the work you put into xowa. That you have time to do any data updates is amazing. Thanks for the link. It looks quite helpful, although it will take time for me to digest it. The process is a bit surprising to me; I would have guessed that the image links would simply be in one of the wikimedia dumps.

My plan was to build a 2019-04-01 en.wikipedia.org and en.wiktionary.org from wikimedia dumps, use download central to add in the 2018-07 images, and then figure out how manually add the image changes between 2018-07 and 2019-04 (which seems more possible now with the link you provided).

Can I expect that most of the 2019-04-01 wiki will work and look good with the 2018-07 image dumps, or would I do better to stick with your 2018-07 wiki articles dump until I can get the manual image dump process working?

gnosygnu commented 5 years ago

The process is a bit surprising to me; I would have guessed that the image links would simply be in one of the wikimedia dumps.

The image links could work, but it would download the original image whereas most articles use thumbs. Although that's useful in and of itself, this would easily use 400-500 GB. Moreover, you'd need a way to convert them into thumbs for the article.

My plan was to build a 2019-04-01 en.wikipedia.org and en.wiktionary.org from wikimedia dumps, use download central to add in the 2018-07 images, and then figure out how manually add the image changes between 2018-07 and 2019-04 (which seems more possible now with the link you provided).

That's generally how I build my updates: take the base (2018-07) and add in the incrementals (everything up to 2019-04)

Can I expect that most of the 2019-04-01 wiki will work and look good with the 2018-07 image dumps, or would I do better to stick with your 2018-07 wiki articles dump until I can get the manual image dump process working?

2019-04 should look ok with 2018-07. There will be about 5% - 10% of images which are missing, but I don't think it will be that noticeable