openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
285 stars 73 forks source link

Don't retry to download if the image optimization fails #1330

Closed kelson42 closed 3 years ago

kelson42 commented 3 years ago

It looks like that if the conversion fails, then it retries to download:

[log] [2020-12-11T04:21:24.988Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Second_Avenue_Subway_Map_vc.jpg/300px-Second_Avenue_Subway_Map_vc.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/306e87e6-8f4d-44e3-9ae6-be9c3c4ca265
Error! Cannot read input picture file '/dev/shm/306e87e6-8f4d-44e3-9ae6-be9c3c4ca265'

[log] [2020-12-11T04:21:24.989Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/LOGO_HAEMMERLIN.jpg/550px-LOGO_HAEMMERLIN.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/9a1dc743-6783-4410-b3db-5513b8d0b6cf
Error! Cannot read input picture file '/dev/shm/9a1dc743-6783-4410-b3db-5513b8d0b6cf'

[log] [2020-12-11T04:21:31.931Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/LOGO_HAEMMERLIN.jpg/550px-LOGO_HAEMMERLIN.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/876eefa9-4027-4641-a24f-01f10a53caa5
Error! Cannot read input picture file '/dev/shm/876eefa9-4027-4641-a24f-01f10a53caa5'

[log] [2020-12-11T04:21:33.042Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Second_Avenue_Subway_Map_vc.jpg/300px-Second_Avenue_Subway_Map_vc.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/531f50c0-6fbc-48ef-8543-144cb8467f2b
Error! Cannot read input picture file '/dev/shm/531f50c0-6fbc-48ef-8543-144cb8467f2b'

[log] [2020-12-11T04:21:40.101Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/LOGO_HAEMMERLIN.jpg/550px-LOGO_HAEMMERLIN.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/6265d7df-bf88-41d3-a4a8-f181e255326c
Error! Cannot read input picture file '/dev/shm/6265d7df-bf88-41d3-a4a8-f181e255326c'

[log] [2020-12-11T04:21:44.847Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Second_Avenue_Subway_Map_vc.jpg/300px-Second_Avenue_Subway_Map_vc.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/1c43e0fc-72a9-4964-9bff-aba5d4cbb0ef
Error! Cannot read input picture file '/dev/shm/1c43e0fc-72a9-4964-9bff-aba5d4cbb0ef'

[log] [2020-12-11T04:21:50.708Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/LOGO_HAEMMERLIN.jpg/550px-LOGO_HAEMMERLIN.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/0f274932-7924-4064-aa95-154e4b1b5a5f
Error! Cannot read input picture file '/dev/shm/0f274932-7924-4064-aa95-154e4b1b5a5f'

T:254882; A:4943000; RA:1; CA:2275010; UA:2667989; FA:0; IA:2274938; C:34785; CC:25945; UC:8840; WC:0
[log] [2020-12-11T04:22:00.197Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Second_Avenue_Subway_Map_vc.jpg/300px-Second_Avenue_Subway_Map_vc.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/a157bd72-cc3f-43cb-9f62-f2865e985364
Error! Cannot read input picture file '/dev/shm/a157bd72-cc3f-43cb-9f62-f2865e985364'

[log] [2020-12-11T04:22:11.262Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/LOGO_HAEMMERLIN.jpg/550px-LOGO_HAEMMERLIN.jpg due to Error: Unsupported color conversion request
Error! Could not process file /dev/shm/bbd3e561-72b4-476b-9d84-45828564cbbd
Error! Cannot read input picture file '/dev/shm/bbd3e561-72b4-476b-9d84-45828564cbbd'

We should avoid that. Instead we should abandon this image and report properly.

tim-moody commented 3 years ago

@kelson42 I had this same problem with many such errors. It looks like the images are valid, so I'm wondering why the color conversion error and whether it has to do with using --webp. See these links (one yours and one mine) for valid images:

https://upload.wikimedia.org/wikipedia/commons/thumb/5/5a/LOGO_HAEMMERLIN.jpg/550px-LOGO_HAEMMERLIN.jpg https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Parlers_de_Corse.jpg/220px-Parlers_de_Corse.jpg

[log] [2020-12-19T08:21:12.603Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Parlers_de_Corse.jpg/220px-Parlers_de_Corse.jpg due to Error: Unsupported color conversion request
Error! Could not process file /tmp/a71caca7-07e0-44e6-8152-3b24b816e877
Error! Cannot read input picture file '/tmp/a71caca7-07e0-44e6-8152-3b24b816e877'

[log] [2020-12-19T08:21:15.668Z] Progress downloading files [509380/833000] [61.2%]
[log] [2020-12-19T08:21:17.597Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Parlers_de_Corse.jpg/220px-Parlers_de_Corse.jpg due to Error: Unsupported color conversion request
Error! Could not process file /tmp/7316ef88-1e64-4493-88d6-3df2871bd846
Error! Cannot read input picture file '/tmp/7316ef88-1e64-4493-88d6-3df2871bd846'

[log] [2020-12-19T08:21:22.165Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/d/df/Parlers_de_Corse.jpg/220px-Parlers_de_Corse.jpg due to Error: Unsupported color conversion request
Error! Could not process file /tmp/82083de5-c058-4975-9c26-bf826710ce27
Error! Cannot read input picture file '/tmp/82083de5-c058-4975-9c26-bf826710ce27'
tim-moody commented 3 years ago

resulting zim looks good, though. I guess the missing images are a small percentage.

tim-moody commented 3 years ago

https://wordpress.org/support/topic/unsupported-color-conversion/ could be relevant. (after 24 hours I have 1600+ of these on the 500K article EN WP)

Mike (@mixar) 1 year, 3 months ago @roselldk I’ve found the source of issue here: https://bugs.chromium.org/p/webp/issues/detail?id=311#c5

And here is a workaround for imagemagick: https://groups.google.com/a/webmproject.org/forum/#!topic/webp-discuss/MH8q_d6M1vM

cwebp uses linux library libjpeg for jpeg convertion and it has a bug for images with CMYK color profiles. Developers of imagemagick created workaround for conversion from CMYK to RGB. So for suck images we can add additional step (convert to RGB colorspace first) if imagemagick or php imagick installed on system.

kelson42 commented 3 years ago

@tim-moody Thank you for the investigation. We have indeed two problems.

tim-moody commented 3 years ago

for further reference compare

https://en.wikipedia.org/wiki/Corsican_language

and http://iiab.me/kiwix/wikipedia_en_top_100k_maxi_2020-12/A/Corsican_language

tim-moody commented 3 years ago

We have indeed two problems.

solving the Unsupported color conversion request would also solve the download would it not? If you trap the color conversion and get the raw jpg or some such can it not be handled?

I guess perhaps you are resizing at source and not handling locally.

holta commented 3 years ago

and http://iiab.me/kiwix/wikipedia_en_top_100k_maxi_2020-12/A/Corsican_language

FYI the image missing from the above page (map of Corsica) was included a month earlier, within ZIM file wikipedia_en_top_maxi_2020-11.zim :

http://iiab.me/kiwix/wikipedia_en_top_maxi_2020-11/A/Corsican_language http://iiab.me/kiwix/wikipedia_en_top_maxi_2020-11/I/m/Parlers_de_Corse.jpg

(...just to confirm this is indeed a regression.)

tim-moody commented 3 years ago

At 99/9% complete the total count for color conversion errors on the 500k en wp is 5768. There are another 500 errors of status 400 and 404. (For some reason the process never ran to 100% even after 3 hours beyond 99.9%)

holta commented 3 years ago

total count for color conversion errors on the 500k en wp is 5768. There are another 500 errors of status 400 and 404.

The 404's might be normal and unavoidable, if Wikipedia's changed a bit in recent days.

@tim-moody is there a sample 400 error, ideally to help investigate the pattern?

(For some reason the process never ran to 100% even after 3 hours beyond 99.9%)

@kelson42 would you know if there are log files or similar ways to diagnose jobs like this that do not complete?

What details are most important to include if it's best to open a new ticket?

kelson42 commented 3 years ago

@tim-moody The only 404 scenario I'm aware of is https://github.com/openzim/mwoffliner/issues/1199. I'm not aware about anything else. Please open another ticket if you have a reproducable scenario. @holta You have the --verbose mode. I don't really need details, I need to understand what is wrong from a user perspective and a clear step-by-step reproduction case.

holta commented 3 years ago

@tim-moody The only 404 scenario I'm aware of is #1199. I'm not aware about anything else. Please open another ticket if you have a reproducable scenario.

Stats might be useful too: of the more-than-500 errors, how many were 400 and how many were 404, @tim-moody can you tell?

@holta You have the --verbose mode.

Good to know, thanks.

tim-moody commented 3 years ago

on the 500k en wp: • Code 400 errors: 431 • Code 404 errors: 107

MananJethwani commented 3 years ago

@tim-moody are 400 and 404 image optimization fail?.....I guess it's a axios fetch fail right

tim-moody commented 3 years ago

@MananJethwani you may be right. I'm not familiar with the internals of the code other than spot checking. When I encountered them I thought they were image related. Here are a few:

[log] [2020-12-21T11:47:47.956Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,10,a,a,270x200.png?lang=en&domain=en.wikipedia.org&title=Phoebe+Putney+Memorial+Hospital&groups=_b7299ee4a2c921b7d2e37228fdadee81ef622bac due to Error: Request failed with status code 400
T:157268; A:2557000; RA:0; CA:500047; UA:2056953; FA:0; IA:499962; C:24001; CC:17144; UC:6857; WC:0
T:157327; A:2558000; RA:0; CA:500047; UA:2057953; FA:0; IA:499962; C:24004; CC:17144; UC:6860; WC:0
[log] [2020-12-21T11:51:40.531Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,11,a,a,270x200.png?lang=en&domain=en.wikipedia.org&title=Floirac%2C+Lot&groups=_ba31b7fa2968a3c704d521ca4b74c8f6827fdf7e due to Error: Request failed with status code 400
T:157386; A:2559000; RA:0; CA:500047; UA:2058953; FA:0; IA:499962; C:24007; CC:17144; UC:6863; WC:0
[log] [2020-12-21T11:52:40.860Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,12,a,a,270x200.png?lang=en&domain=en.wikipedia.org&title=Cr%C3%A9zan%C3%A7ay-sur-Cher&groups=_d6cc571d9c27738b4e1f0d58b5051eda64d5ceaa due to Error: Request failed with status code 400
[log] [2020-12-21T11:53:50.133Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_Bulgaria/0/a2c6f586447aae8a5b4172965acff000f9145eb0.png due to Error: Request failed with status code 400
[log] [2020-12-21T11:53:50.528Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,11,a,a,270x200.png?lang=en&domain=en.wikipedia.org&title=Baldissero+d%27Alba&groups=_0bbb6a25f07d76057aad0b33b7fcb69f3cf3ac7e due to Error: Request failed with status code 400
[log] [2020-12-21T11:54:01.639Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_Singapore/0/5293e39459bd57bca5024bd4503493c13a22e544.png due to Error: Request failed with status code 400
Error! Could not process file /tmp/23deba02-fa4c-4007-9e32-a0909802fb0a
Error! Cannot read input picture file '/tmp/23deba02-fa4c-4007-9e32-a0909802fb0a'
T:157679; A:2564000; RA:0; CA:500047; UA:2063953; FA:0; IA:499962; C:24022; CC:17144; UC:6878; WC:0
[log] [2020-12-21T12:01:19.823Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_the_Faroe_Islands/0/4efd31a17f8683e506adbf407a98a110f226d095.png due to Error: Request failed with status code 400
[log] [2020-12-21T12:04:24.586Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_Bulgaria/0/408a3b97119f146fdf584c4e7d67bcc5d106d652.png due to Error: Request failed with status code 400
T:158266; A:2574000; RA:0; CA:500047; UA:2073953; FA:0; IA:499962; C:24055; CC:17144; UC:6911; WC:0
[log] [2020-12-21T12:07:51.792Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_South_Korea/0/314660194699767160f607e9eb5079106c8b2bd0.png due to Error: Request failed with status code 400
[log] [2020-12-21T12:08:40.436Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,11,40.4,-3.65,500x440.png?lang=en&domain=en.wikipedia.org&title=Madrid+Metro&groups=_4428c8eccc3ce06f3fdc9ac9a9d4b908a0d68ac4 due to Error: Request failed with status code 400
T:158856; A:2584000; RA:0; CA:500047; UA:2083953; FA:0; IA:499962; C:24089; CC:17144; UC:6945; WC:0
[log] [2020-12-21T12:22:41.856Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_Argentina/0/55b972182fbd008d854ed6b074dd3492c4e3c53c.png due to Error: Request failed with status code 400
Error! Could not process file /tmp/226eb93d-c2a0-400d-ad4b-3f06c25ef549
Error! Cannot read input picture file '/tmp/226eb93d-c2a0-400d-ad4b-3f06c25ef549'
[log] [2020-12-21T12:24:51.693Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,11,a,a,270x200.png?lang=en&domain=en.wikipedia.org&title=Bougon&groups=_191b326807a22e00c3085ce578d8aa2249c35f3b due to Error: Request failed with status code 400
[log] [2020-12-21T12:26:43.359Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_Asturias/0/891f12bea4e1c4f267d8c029863ed189cd744c38.png due to Error: Request failed with status code 400
T:159444; A:2594000; RA:0; CA:500047; UA:2093953; FA:0; IA:499962; C:24120; CC:17144; UC:6976; WC:0
T:160035; A:2604000; RA:0; CA:500047; UA:2103953; FA:0; IA:499962; C:24154; CC:17144; UC:7010; WC:0
[log] [2020-12-21T12:42:27.328Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,6,50,-98.3,290x200.png?lang=en&domain=en.wikipedia.org&title=Manitoba+Highway+1&groups=_64323a0701da390c3992ff42fa80da9ce01291f6 due to Error: Request failed with status code 400
[log] [2020-12-21T12:44:47.909Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,7,23.54,112.16,300x200.png?lang=en&domain=en.wikipedia.org&title=Zhaoqing&groups=_1ab789b754e2ccae778059f2cadfbe2cb316ecae due to Error: Request failed with status code 400
T:160623; A:2614000; RA:0; CA:500049; UA:2113951; FA:0; IA:499962; C:24186; CC:17144; UC:7042; WC:0
[log] [2020-12-21T12:48:21.369Z] Not able to download content for https://en.wikipedia.org/api/rest_v1/page/graph/png/COVID-19_pandemic_in_Croatia/0/109596f09e8a7e41d75da6dc41c7e6ecbcc4cf6f.png due to Error: Request failed with status code 400
[log] [2020-12-21T12:49:12.619Z] Not able to download content for https://maps.wikimedia.org/img/osm-intl,11,a,a,270x200.png?lang=en&domain=en.wikipedia.org&title=Seebach%2C+Bas-Rhin&groups=_2dfabbb69916cd357100553f8c56572d5786206b due to Error: Request failed with status code 400
tim-moody commented 3 years ago

and some 404s

[log] [2020-12-21T08:23:53.254Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/Jodie_Foster_1991.jpg/170px-Jodie_Foster_1991.jpg due to Error: Request failed with status code 404
[log] [2020-12-21T08:28:49.064Z] Not able to download content for https://upload.wikimedia.org/wikipedia/en/thumb/6/6b/WHSmith_logo.svg/220px-WHSmith_logo.svg.png due to Error: Request failed with status code 404
[log] [2020-12-21T08:41:16.452Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/0/02/COVID19_in_Turkey_-_Cumulative_positive_cases_per_100k_residents.svg/220px-COVID19_in_Turkey_-_Cumulative_positive_cases_per_100k_residents.svg.png due to Error: Request failed with status code 404
[log] [2020-12-21T08:54:06.739Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/a/a3/TRS-Party-Symbol-CAR1.jpg/150px-TRS-Party-Symbol-CAR1.jpg due to Error: Request failed with status code 404
[log] [2020-12-21T09:27:17.383Z] Not able to download content for https://upload.wikimedia.org/wikipedia/en/thumb/6/6e/Super_Paper_Mario_Gameplay.jpeg/200px-Super_Paper_Mario_Gameplay.jpeg due to Error: Request failed with status code 404
[log] [2020-12-21T09:34:22.826Z] Not able to download content for https://upload.wikimedia.org/wikipedia/en/thumb/4/4c/Winona_State_University_wordmark.svg/220px-Winona_State_University_wordmark.svg.png due to Error: Request failed with status code 404
[log] [2020-12-21T09:48:20.056Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/M04-UA.svg/42px-M04-UA.svg.png due to Error: Request failed with status code 404
[log] [2020-12-21T10:24:24.700Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/c/c1/Marcus_Davenport.jpg/120px-Marcus_Davenport.jpg due to Error: Request failed with status code 404
[log] [2020-12-21T10:34:56.499Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/d/d0/Cataluna_in_Spain_%28plus_Canarias%29.svg/250px-Cataluna_in_Spain_%28plus_Canarias%29.svg.png due to Error: Request failed with status code 404
[log] [2020-12-21T10:43:21.030Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/M06-UA.svg/42px-M06-UA.svg.png due to Error: Request failed with status code 404
[log] [2020-12-21T10:51:42.559Z] Not able to download content for https://upload.wikimedia.org/wikipedia/en/a/ae/WKRCLocal12.png due to Error: Request failed with status code 404
[log] [2020-12-21T11:01:15.837Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/5/55/..Kerala_Flag%28INDIA%29.png/30px-..Kerala_Flag%28INDIA%29.png due to Error: Request failed with status code 404
[log] [2020-12-21T11:31:11.255Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/0/0c/SeletarAirport.webp/220px-SeletarAirport.webp.png due to Error: Request failed with status code 404
[log] [2020-12-21T11:51:50.577Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/2/2e/LTGen_Erickson_Gloria.jpg/110px-LTGen_Erickson_Gloria.jpg due to Error: Request failed with status code 404
[log] [2020-12-21T11:54:46.382Z] Not able to download content for https://upload.wikimedia.org/wikipedia/commons/thumb/f/f5/1992_Ponitac_Firefly.jpg/220px-1992_Ponitac_Firefly.jpg due to Error: Request failed with status code 404