openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
283 stars 72 forks source link

[REGRESSION] Crash by creating WPEN "History" selection (with 1.8.4) #697

Closed kelson42 closed 5 years ago

kelson42 commented 5 years ago
mwoffliner --verbose --mwUrl="https://en.wikipedia.org/" --adminEmail=kelson@kiwix.org --customZimFavicon="https://upload.wikimedia.org/wikipedia/commons/a/af/P_history.png" --customZimTitle="History by Wikipedia" --customZimDescription="Wikipedia articles dedicated to History" --articleList=https://download.kiwix.org/wp1/enwiki/projects/History --format=nopic --format=novid

gives

[info] [2019-04-24T10:52:39.363Z] Getting JSON from [https://en.wikipedia.org/w/api.php?titles=1391_Yellow_River_flood%7CTito_Livio_Frulovisi%7CIelidassen%7CAction_at_Kalmas%7C1870_in_Uruguay%7C1375_Yellow_River_flood%7CSeaman's_chest%7CRepublican_Building_(Jinan)%7CList_of_environmental_history_topics%7CHertfordshire_Association_for_Local_History%7CGerman_tariff_of_1925%7CFGCI%7CAnariacae%7C1441_Yangtze_flood%7CMedieval_Polish_Alliances%7CDongyi_Protectorate%7C1871_in_Uruguay%7CDevonshire_Declaration%7C1390_Yellow_River_flood%7CRochefoucauld_Grail%7CMedieval_Scenarios_and_Recreations%7CCognitive_Tempo%7CBattle_of_Verbia%7C1453_Yellow_River_flood%7C1448_Yellow_River_flood%7CTheodora_Angelina%7CLecheor%7CGamsansa%7CAravelian%7CAn_Universal_Biographical_and_Historical_Dictionary%7C1416_Yellow_River_flood%7C1410_Yellow_River_flood%7CYing_Baoshi%7CMah_farvardin_Ruz_khordad%7CHistory_of_Sonora%7CFrankish_towers_of_Greece%7CExpedition_of_Ghalib_ibn_Abdullah_al-Laithi_(Al-Kadid)%7CZhang_Nanyang%7CWeissman_Preservation_Center%7CVermilion_Pencil%7CHistory_of_A.C.R._Messina%7CFaculty_and_alumni_of_the_University_of_Constantinople%7CCentrosibir%7CArcharuni%7C1384_Yellow_River_flood%7CZhang_Yuming%7CStudi_sul_Settecento_Romano%7CSaint_Peter's_Abbey_on_the_Madron%7COV_Gallery%7CEruandhuni&prop=redirects%7Ccoordinates%7Crevisions%7Cpageimages&action=query&format=json&rdlimit=max&colimit=max]
[log] [2019-04-24T10:52:39.425Z] Worker [4] getting article range [31850-31900] of [31971] [99%]
[info] [2019-04-24T10:52:39.425Z] Getting JSON from [https://en.wikipedia.org/w/api.php?titles=Zhuang_Jia_(Qi)%7CSakurabora_Castle%7CPau_de_Bellviure%7CBarhadbshabba_Arbaya%7CTimeline_of_Rival_Political_Parties%7C1688_Revolution%7CGet_Well_Soon%3A_History's_Worst_Plagues_and_the_Heroes_Who_Fought_Them%7CDrina_(%C5%BEupa)%7CSindanminsa%7CH._Louis_Nichols%7CRepublican_Era%7CList_of_colonial_governors_in_1752%7CList_of_Navajo_Nation_Chapters%7CJacobs'_Inn%7CBackstugusittare%7CMelchior_de_Gualbes%7C1980_in_Namibia%7CSchauenburg_Castle%7CSanto_Stefano%2C_Belluno%7CPrior_of_the_Caporioni%7CDissidents_in_the_1989_Tiananmen_Square_protests%7CVlado_(kaznac)%7CSamuel_Eliot_Morison_bibliography%7CSchleifer%7CTimeline_of_London_Weekend_Television%7COodians%7CList_of_medieval_Gaue%7CList_of_History's_Lost_%26_Found_episodes%7CFamily_of_Verona%7CChrysler_Charger_III%7CAncient_Africa%7C1868_in_Uruguay%7CMithrenes_II%7CLeach_(food)%7CTreaty_of_Andernach_(1059)%7CM%C3%A1el_Brigte_of_Perth%7CMary_Ann_Neeley%7CGuan_Yunchang%7CBerkshire_Conference_of_Women_Historians_Book_Prize%7CMedieval_fashion%7CMichael_Apokapes%7CGothic_writing%7CEel_Pie_Island_Museum%7CSeek_for_Surname_History%7CList_of_Ottoman_domes%7CUppland_Runic_Inscription_15%7CSeparatio_Leprosorum%7CBook_of_Tang_(disambiguation)%7CThomas_Copeland_(headmaster)%7CProfessionalization_and_institutionalization_of_history&prop=redirects%7Ccoordinates%7Crevisions%7Cpageimages&action=query&format=json&rdlimit=max&colimit=max]
[log] [2019-04-24T10:52:39.583Z] Worker [6] getting article range [31900-31950] of [31971] [99%]
[info] [2019-04-24T10:52:39.583Z] Getting JSON from [https://en.wikipedia.org/w/api.php?titles=History_of_Changsha%7CTimeline_of_Tegucigalpa%7CList_of_historic_places_in_New_Jersey%7C1897_in_Uruguay%7CLes_sept_%C3%A2ges_du_monde%7CJianzhuke_Shu%7CC.%7CAdministrative_divisions_of_medieval_Serbia%7CFrithuwold%7CTimeline_of_the_George_H._W._Bush_presidency_(1992)%7CTimeline_of_Uppsala%7CBattle_of_Suchodo%C5%82%7CTyniec_Sacramentarium%7CMorale_scolarium%7CHistory_of_football_in_Cape_Verde%7CJean_de_Nivelle_(1422-1477)%7CMute_Rebellion%7CTazarene%7CDieulacres_Chronicle%7CMankby%7CHenry_II_K%C5%91szegi%7CConvention_of_Mat%7CTrot_(lai)%7CMark_Gjini%7CThe_Classic_of_the_Plough%7CRole_of_Nantes_in_the_slave_trade%7CList_of_defunct_airlines_of_Ivory_Coast%7CLeabhar_Donn%7CRomuleon%7CCounts_of_Woldenberg%7CTimeline_of_Thames_Television%7CElizabeth_Shelford%7CMuhammad_ibn_al-Ash'ath_al-Kindi%7CBritain's_Bourse%7CGuillaume_Chartier_(theologian)%7CVodka_protests_of_1858%E2%80%931859%7CFreemen's_pennies%7CTimeline_of_Scottish_Television%7CMarturina%7CHercules_Magusanus%7CThe_Berkeley_Treatise%7CTimeline_of_ATV%7CPierre_de_l'Argenti%C3%A8re%7CTicao_stone_inscription%7CTimeline_of_Carlton_Television%7CJournal_of_the_History_of_Collections%7CGast%C3%B3n_Antonio_Zapata_Velasco%7CDame_Siri%C3%BE%7CHistory_of_Eastern_Germany%7CImperial_Twilight&prop=redirects%7Ccoordinates%7Crevisions%7Cpageimages&action=query&format=json&rdlimit=max&colimit=max]
[log] [2019-04-24T10:52:39.607Z] Worker [8] getting article range [31950-32000] of [31971] [100%]
[info] [2019-04-24T10:52:39.607Z] Getting JSON from [https://en.wikipedia.org/w/api.php?titles=Timeline_of_Granada_Television%7CManifesto_of_the_Province_of_Flanders%7CMuseum_of_Local_History%7CTimeline_of_Yorkshire_Television%7CJohn_K%C5%91szegi%7CTimeline_of_Central_Independent_Television%7CTimeline_of_Anglia_Television%7CTimeline_of_Southern_Television%7CTimeline_of_HTV_West%7CTimeline_of_Tyne_Tees_Television%7CTimeline_of_ITV_in_Wales%7CTimeline_of_TVS%7CTimeline_of_Border_Television%7CTimeline_of_TSW%7CTimeline_of_Ulster_Television%7CTimeline_of_Channel_Television%7CEklakhi_Mausoleum%7CAlison_Dunhill%7CTimeline_of_Westcountry_Television%7CTimeline_of_Meridian_Broadcasting%7CTimeline_of_Grampian_Television&prop=redirects%7Ccoordinates%7Crevisions%7Cpageimages&action=query&format=json&rdlimit=max&colimit=max]
[log] [2019-04-24T10:52:39.887Z] Doing dump
[log] [2019-04-24T10:52:39.888Z] Writing zim to [/dev/shm/mwoffliner/out/wikipedia_en_history_nopic_2019-04.zim]
[info] [2019-04-24T10:52:39.897Z] Copying Static Resource Files
[info] [2019-04-24T10:52:39.901Z] Finding stylesheets to download
[info] [2019-04-24T10:52:39.901Z] Downloading [https://en.wikipedia.org/wiki/]
[log] [2019-04-24T10:52:40.066Z] Found [4] stylesheets to download
[log] [2019-04-24T10:52:40.066Z] Downloading stylesheets and populating media queue
[info] [2019-04-24T10:52:40.067Z] Downloading CSS from http://en.wikipedia.org/w/load.php?lang=en&modules=ext.3d.styles|ext.uls.interlanguage|ext.visualEditor.desktopArticleTarget.noscript|ext.wikimediaBadges|mediawiki.legacy.commonPrint%2Cshared|mediawiki.skinning.interface|skins.vector.styles&only=styles&skin=vector
[info] [2019-04-24T10:52:40.067Z] Downloading CSS from http://en.wikipedia.org/w/load.php?lang=en&modules=ext.gadget.charinsert-styles&only=styles&skin=vector
[info] [2019-04-24T10:52:40.067Z] Downloading CSS from http://en.wikipedia.org/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector
[info] [2019-04-24T10:52:40.067Z] Downloading CSS from https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw
[info] [2019-04-24T10:52:40.067Z] Downloading [http://en.wikipedia.org/w/load.php?lang=en&modules=ext.3d.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cskins.vector.styles&only=styles&skin=vector]
[info] [2019-04-24T10:52:40.067Z] Downloading [http://en.wikipedia.org/w/load.php?lang=en&modules=ext.gadget.charinsert-styles&only=styles&skin=vector]
[info] [2019-04-24T10:52:40.067Z] Downloading [http://en.wikipedia.org/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector]
[info] [2019-04-24T10:52:40.068Z] Downloading [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw]
[info] [2019-04-24T10:52:40.371Z] Downloading [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw]
[info] [2019-04-24T10:52:40.766Z] Downloading [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw]
[info] [2019-04-24T10:52:41.353Z] Downloading [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw]
[info] [2019-04-24T10:52:42.352Z] Downloading [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw]
[info] [2019-04-24T10:52:44.161Z] Downloading [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw]
[warn] [2019-04-24T10:52:44.362Z] Failed to get [https://en.wikipedia.org/wiki/Mediawiki:offline.css?action=raw] [5] times
[log] [2019-04-24T10:52:44.363Z] Downloaded stylesheets
[log] [2019-04-24T10:52:44.364Z] Getting Favicon
[log] [2019-04-24T10:52:44.365Z] Saving favicon.png...
Failed to run mwoffliner after [53s]: {
    "stack": "URIError: URI malformed\n    at decodeURIComponent (<anonymous>)\n    at Object.getMediaBase (/usr/local/lib/node_modules/mwoffliner/lib/util/misc.js:345:20)\n    at /usr/local/lib/node_modules/mwoffliner/lib/mwoffliner.lib.js:148:65\n    at step (/usr/local/lib/node_modules/mwoffliner/lib/mwoffliner.lib.js:35:23)\n    at Object.next (/usr/local/lib/node_modules/mwoffliner/lib/mwoffliner.lib.js:16:53)\n    at fulfilled (/usr/local/lib/node_modules/mwoffliner/lib/mwoffliner.lib.js:7:58)\n    at process._tickCallback (internal/process/next_tick.js:68:7)",
    "message": "URI malformed"
}

**********

URI malformed

**********

[log] [2019-04-24T10:52:44.521Z] Exiting with code [2]
[log] [2019-04-24T10:52:44.522Z] Deleting tmp dump dir [/tmp/mwo-dump-1556103111872]
[log] [2019-04-24T10:52:44.524Z] Flushing REDIS DBs
ISNIT0 commented 5 years ago

I can't reproduce this 😢 I've tried the exact command on my local and VPS with 1.8.4. MWO continues past the favicon stage and starts scraping articles

--verbose might show something helpful (the temporary file the article should be downloaded to)

kelson42 commented 5 years ago

@ISNIT0 This is verbose mode obviously... as non versbose should not display log entries. Please make the log working in a way it is clear what happen here.

ISNIT0 commented 5 years ago

I'm now able to reproduce this, it's specific to the thumbnail url formatting in a few of the articles (inc. Germanic-Roman_contacts).

I've got a fix and will create a PR with it included tomorrow.