openzim / mwoffliner

Mediawiki scraper: all your wiki articles in one highly compressed ZIM file
https://www.npmjs.com/package/mwoffliner
GNU General Public License v3.0
296 stars 75 forks source link

wiktionary_en output stops for ~ 15 min. then continues again #1042

Closed danielboven closed 4 years ago

danielboven commented 4 years ago

I'm running mwoffliner for around 5 days now in order to scrape the English Wiktionary. I noticed that probably since a day or so the output has decreased massively. The process starts scraping something for a minute or so (and it outputs to the terminal), but then stops and is stuck for a around 15 minutes. After the message Heatbeat - OK it continues again (so it stops being stuck, starts producing output). The resource manager also confirms this. During normal scraping the peak is at 30% CPU usage, but when the output stops the process goes down to 0% CPU usage.

This is an example of the log I'm getting:

[log] [2020-03-07T09:39:11.230Z] Worker [7] Processing batch of article ids [["zepto-moles","+9 more +","shinrimei"]]
[log] [2020-03-07T09:39:15.721Z] Worker [7] Processing batch of article ids [["αιφνιδιασμών","+8 more +","sarkanbrūniem"]]
[log] [2020-03-07T09:39:18.982Z] Worker [7] Processing batch of article ids [["أذيل","+8 more +","commisceant"]]
[log] [2020-03-07T09:54:02.678Z] Heartbeat - OK
[log] [2020-03-07T09:54:02.678Z] Worker [5] Processing batch of article ids [["dekokjarajn","+8 more +","assistiram"]]
[log] [2020-03-07T09:54:06.223Z] Worker [5] Processing batch of article ids [["ovillareis","+8 more +","palaeogenomic"]]
[log] [2020-03-07T09:54:06.628Z] Worker [9] Processing batch of article ids [["devitalisation","+10 more +","sandsynlighedstætheders"]]
[log] [2020-03-07T09:54:07.207Z] Worker [4] Processing batch of article ids [["deludenti","+9 more +","seghe_a_nastro"]]
[log] [2020-03-07T09:54:08.216Z] Worker [7] Processing batch of article ids [["ניצל","+8 more +","trinasteis"]]
[log] [2020-03-07T09:54:09.757Z] Worker [5] Processing batch of article ids [["bråare","+11 more +","parachèves"]]
[log] [2020-03-07T09:54:10.843Z] Worker [9] Processing batch of article ids [["carnaciones","+8 more +","conscelerato"]]
[log] [2020-03-07T09:54:11.629Z] Worker [4] Processing batch of article ids [["obseravissent","+8 more +","sobrecogiere"]]
[log] [2020-03-07T09:54:12.230Z] Worker [7] Processing batch of article ids [["community_cats","+8 more +","pascitastis"]]
[log] [2020-03-07T09:54:14.672Z] Worker [9] Processing batch of article ids [["гиревику","+8 more +","unstories"]]
[log] [2020-03-07T09:54:14.985Z] Worker [4] Processing batch of article ids [["физиономиею","+9 more +","Бурманка"]]
[log] [2020-03-07T09:54:15.002Z] Worker [5] Processing batch of article ids [["sublego","+8 more +","コンテ"]]
[log] [2020-03-07T09:54:16.965Z] Worker [7] Processing batch of article ids [["keratotomes","+8 more +","pintautua"]]
[log] [2020-03-07T09:54:18.849Z] Worker [4] Processing batch of article ids [["մոտակայություն","+9 more +","minitransposons"]]
[log] [2020-03-07T09:54:19.178Z] Worker [9] Processing batch of article ids [["abavunculus","+8 more +","rajtigontan"]]
[log] [2020-03-07T09:54:19.300Z] Worker [5] Processing batch of article ids [["alegrarás","+8 more +","superordinations"]]
[log] [2020-03-07T09:54:20.806Z] Worker [7] Processing batch of article ids [["hašemiitti","+8 more +","orðuna"]]
[log] [2020-03-07T09:54:22.874Z] Worker [4] Processing batch of article ids [["Τανάγρας","+9 more +","back-water"]]
[log] [2020-03-07T09:54:23.131Z] Worker [5] Processing batch of article ids [["unitoj","+9 more +","молитву"]]
[log] [2020-03-07T09:54:23.557Z] Worker [9] Processing batch of article ids [["confédérâtes","+8 more +","yôons"]]
[log] [2020-03-07T09:54:24.194Z] Worker [7] Processing batch of article ids [["viscaurredzamāko","+8 more +","Tia_Maria"]]
[log] [2020-03-07T09:54:26.872Z] Worker [4] Processing batch of article ids [["prevenirci","+9 more +","denied"]]
[log] [2020-03-07T09:54:27.154Z] Worker [5] Processing batch of article ids [["ausbautet","+8 more +","canguro_robusto"]]
[log] [2020-03-07T09:54:27.288Z] Worker [9] Processing batch of article ids [["word-salads","+8 more +","appaiserons"]]
[log] [2020-03-07T09:54:29.238Z] Worker [7] Processing batch of article ids [["acogotarían","+8 more +","reconstruías"]]
[log] [2020-03-07T09:54:30.818Z] Worker [5] Processing batch of article ids [["pörino","+8 more +","cheeser"]]
[log] [2020-03-07T09:54:30.890Z] Worker [4] Processing batch of article ids [["jointoierions","+8 more +","fǒuzé"]]
[log] [2020-03-07T09:54:32.196Z] Worker [9] Processing batch of article ids [["барих","+8 more +","kanarazu"]]
[log] [2020-03-07T09:54:32.798Z] Worker [7] Processing batch of article ids [["Pan-African","+9 more +","приспичит"]]
[log] [2020-03-07T09:54:36.237Z] Worker [9] Processing batch of article ids [["actinothérapie","+8 more +","rimpaginarono"]]
[log] [2020-03-07T09:54:36.302Z] Worker [4] Processing batch of article ids [["ytterbín","+8 more +","transatlantiche"]]
[log] [2020-03-07T09:54:36.868Z] Worker [5] Processing batch of article ids [["bifahan","+11 more +","opdagelse"]]
[log] [2020-03-07T09:54:39.825Z] Worker [7] Processing batch of article ids [["lastimaram","+10 more +","libatae"]]
[log] [2020-03-07T09:54:40.521Z] Worker [9] Processing batch of article ids [["körötök","+10 more +","uglifier"]]
[log] [2020-03-07T09:54:40.779Z] Worker [4] Processing batch of article ids [["attardarti","+9 more +","billete_verde"]]
[log] [2020-03-07T09:54:44.120Z] Worker [7] Processing batch of article ids [["Saugwurms","+9 more +","gorgonopsian"]]
[log] [2020-03-07T09:54:44.602Z] Worker [9] Processing batch of article ids [["adludebas","+9 more +","ceoldráma"]]
[log] [2020-03-07T09:54:48.404Z] Worker [7] Processing batch of article ids [["chapardas","+8 more +","automejoraron"]]
[log] [2020-03-07T09:54:48.604Z] Worker [9] Processing batch of article ids [["patrouillaient","+8 more +","феминисткою"]]
[log] [2020-03-07T10:10:07.156Z] Heartbeat - OK
[log] [2020-03-07T10:10:07.157Z] Worker [5] Processing batch of article ids [["defendunt","+8 more +","охранкою"]]
[log] [2020-03-07T10:10:08.619Z] Worker [4] Processing batch of article ids [["higenamine","+8 more +","ajudan"]]

To start I used the command mwoffliner --mwUrl=https://en.wiktionary.org/ --adminEmail=x@outlook.com --outputDirectory=/mnt/sda1/dump/zim/02-03-2020/en.wiktionary.org on a Ubuntu Server, 18.04 LTS machine. I have 16GB of memory btw.

This is probably the third time that this happens to me when trying to scrape Wiktionary, so restarting doesn't help. What could be the cause of this?

LakmaNeha commented 4 years ago

It doesn't necessarily be the lag, it might be downloading the dependencies and the content. Try running with verbose flag, you can get some extra insights on what's going on.

danielboven commented 4 years ago

@LakmaNeha I think you're right, it seems to be doing something else in the background. I haven't started it with verbose (yet), but it has finished successfully now. Thanks for your explanation ;)