Closed kelson42 closed 4 years ago
I tried run mwoffliner --mwUrl=https://gbf.wiki/ --mwWikiPath="" --localParsoid --resume --mwApiPath=api.php --keepHtml --adminEmail=gbf@ahxxm.com --speed=2 --verbose 2>&1 | tee gbf-wiki.log
and see same error log..
but seems data are downloaded and processed? I can see html files under tmp/wiki_en/, and generate zim file:
root@scw-435f21:~/gbf-wiki/tmp/wiki_en_all_2018-09# zimwriterfs --welcome=index.htm --favicon=favicon.png --language=eng --verbose --tags="gbf" --name="kiwix.wiki_en_all" --redirects="/root/gbf-wiki/cac/wiki_en/wiki_en_all.redirects" --title="Granblue Fantasy Wiki" --description="Granblue Fantasy Wiki" --creator="Wiki" --publisher="Kiwix" "/root/gbf-wiki/tmp/wiki_en_all_2018-09/" "/root/gbf-wiki/out/wiki_en_all_2018-09.zim"
Visiting directory /root/gbf-wiki/tmp/wiki_en_all_2018-09
Visiting directory /root/gbf-wiki/tmp/wiki_en_all_2018-09/j
Visiting directory /root/gbf-wiki/tmp/wiki_en_all_2018-09/j/js_modules
Visiting directory /root/gbf-wiki/tmp/wiki_en_all_2018-09/m
A:1000; CA:321; UA:679; FA:687; IA:304; C:13; CC:5; UC:8
A:2000; CA:321; UA:1679; FA:1687; IA:304; C:27; CC:5; UC:22
A:3000; CA:321; UA:2679; FA:2687; IA:304; C:40; CC:5; UC:35
Visiting directory /root/gbf-wiki/tmp/wiki_en_all_2018-09/s
Visiting directory /root/gbf-wiki/tmp/wiki_en_all_2018-09/s/css_modules
Reading redirects TSV file /root/gbf-wiki/cac/wiki_en/wiki_en_all.redirects...
A:4000; CA:1301; UA:2699; FA:2707; IA:472; C:42; CC:7; UC:35
A:4381; CA:1682; UA:2699; FA:2707; IA:472; C:42; CC:7; UC:35
sort 4381 directory entries (aid)
remove invalid redirects from 4381 directory entries
0/4381 directory entries checked for invalid redirects
...
root@scw-435f21:~/gbf-wiki# du -sh
902M .
root@scw-435f21:~/gbf-wiki/out# ls
wiki_en_all_2018-09.zim
root@scw-435f21:~/gbf-wiki/out# du -sh
71M
I'm still resuming from current crawling session(something took 90000+ms to process and trigger exit), so correctness can not be checked yet.
result zim has basic functionality, but seems css was broken/stripped, for example https://gbf.wiki/Rosetta_(Grand) :
other issues:
{input}.*
? I'd like to see Rosseta_(Grand)
to appear in search results using keyword grand
... These same errors seem to appear when using Parsoid directly. I don't see an obvious way to fix them from MWOffliner (I've yet to look at the broken TOC links)
@kelson42 I'd appreciate some guidance here.
MWOffliner makes a request to https://gbf.wiki/api.php?action=parse&format=json&page=Rosetta_(Grand)&prop=modules%7Cjsconfigvars%7Cheadhtml Which returns a modulestyles array. I'd expect this array to contain:
ext.cite.styles
mediawiki.legacy.commonPrint
It only contains ext.cite.styles
thoughAlso, what happens to site.styles
?
@ISNIT0 Definitly overasked. @subbuss @cscott any help welcome.
@ISNIT0 Regarding this ticket, I'm not sure what is the status as over comments many different problem have been reported. I'm not even sure how your last question is related to any on the impact reported earlier (crash). My question would be: what is the status in term of concrete problem and related investigation status?
MWOffliner completes it's execution, Parsoid throws lots of errors. I've opened an issue with them: https://phabricator.wikimedia.org/T209151
The errors happen even when calling Parsoid directly.
I pospone to 1.7, as work seems to have to be done on Parsoid side and I do not want to have this ticket blocked the 1.6 release.
MWOffliner completes it's execution, Parsoid throws lots of errors. I've opened an issue with them:
The level for these particular logs are warn/api/main
and should be safe to ignore. See the details at https://phabricator.wikimedia.org/T209151#4763422
I reopen the ticket as I fail to get a ZIM file of gbf.wiki with mwoffliner 1.7
mwoffliner --mwUrl="https://gbf.wiki/" --mwWikiPath="" --mwApiPath="/api.php" --adminEmail="kelson@kiwix.com" --localMcs --verbose
and at the end
{"name":"mwoffliner","hostname":"camber","pid":19585,"level":60,"logType":"fatal/request","wiki":"wiki$0","title":"Yaia_(Holiday)/Lore","oldId":null,"reqId":null,"userAgent":"service-mobileapp-node","msg":"Not acceptable.\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\n","stack":"","httpStatus":406,"longMsg":"Not acceptable.\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\n","levelPath":"fatal/request","time":"2019-01-14T19:03:18.248Z","v":0}
{"name":"mwoffliner","hostname":"camber","pid":19585,"level":60,"logType":"fatal/request","wiki":"wiki$0","title":"Yaia_(Holiday)/Voice","oldId":null,"reqId":null,"userAgent":"service-mobileapp-node","msg":"Not acceptable.\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\n","stack":"","httpStatus":406,"longMsg":"Not acceptable.\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\n","levelPath":"fatal/request","time":"2019-01-14T19:03:18.249Z","v":0}
{"name":"mwoffliner","hostname":"camber","pid":19585,"level":60,"logType":"fatal/request","wiki":"wiki$0","title":"Yaia_(Holiday)/Strategy","oldId":null,"reqId":null,"userAgent":"service-mobileapp-node","msg":"Not acceptable.\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\n","stack":"","httpStatus":406,"longMsg":"Not acceptable.\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\ntext/html; charset=utf-8; profile=\"https://www.mediawiki.org/wiki/Specs/HTML/[object Object]\"\n","levelPath":"fatal/request","time":"2019-01-14T19:03:18.251Z","v":0}
{"name":"mcs","hostname":"camber","pid":19585,"level":30,"message":"406","status":406,"type":"internal_error","detail":"Not acceptable.\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\n","request_id":"0d9cbb20-182f-11e9-8281-2b1eb777c7f0","request":{"url":"/gbf.wiki/v1/page/mobile-sections/Yaia_(Holiday)%2FLore","headers":{"user-agent":"node-fetch/1.0 (+https://github.com/bitinn/node-fetch)","x-request-id":"0d9cbb20-182f-11e9-8281-2b1eb777c7f0"},"method":"GET","params":{"0":"/gbf.wiki/v1/page/mobile-sections/Yaia_(Holiday)/Lore"},"query":{},"remoteAddress":"127.0.0.1","remotePort":59498},"levelPath":"info/406","msg":"406","time":"2019-01-14T19:03:18.252Z","v":0}
{"name":"mcs","hostname":"camber","pid":19585,"level":30,"message":"406","status":406,"type":"internal_error","detail":"Not acceptable.\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\n","request_id":"0d9cbb21-182f-11e9-bdb3-751e579ed7c8","request":{"url":"/gbf.wiki/v1/page/mobile-sections/Yaia_(Holiday)%2FVoice","headers":{"user-agent":"node-fetch/1.0 (+https://github.com/bitinn/node-fetch)","x-request-id":"0d9cbb21-182f-11e9-bdb3-751e579ed7c8"},"method":"GET","params":{"0":"/gbf.wiki/v1/page/mobile-sections/Yaia_(Holiday)/Voice"},"query":{},"remoteAddress":"127.0.0.1","remotePort":59500},"levelPath":"info/406","msg":"406","time":"2019-01-14T19:03:18.252Z","v":0}
{"name":"mcs","hostname":"camber","pid":19585,"level":30,"message":"406","status":406,"type":"internal_error","detail":"Not acceptable.\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\ntext/html; charset=utf-8; profile="https://www.mediawiki.org/wiki/Specs/HTML/[object Object]"\n","request_id":"0d9cbb22-182f-11e9-b12e-a193bbc6fe01","request":{"url":"/gbf.wiki/v1/page/mobile-sections/Yaia_(Holiday)%2FStrategy","headers":{"user-agent":"node-fetch/1.0 (+https://github.com/bitinn/node-fetch)","x-request-id":"0d9cbb22-182f-11e9-b12e-a193bbc6fe01"},"method":"GET","params":{"0":"/gbf.wiki/v1/page/mobile-sections/Yaia_(Holiday)/Strategy"},"query":{},"remoteAddress":"127.0.0.1","remotePort":59502},"levelPath":"info/406","msg":"406","time":"2019-01-14T19:03:18.252Z","v":0}
[error] [2019-01-14T19:03:18.253Z] Error handling json response from api. TypeError: Cannot read property 'displaytitle' of undefined
[error] [2019-01-14T19:03:18.253Z] Error handling json response from api. TypeError: Cannot read property 'displaytitle' of undefined
[error] [2019-01-14T19:03:18.253Z] Error handling json response from api. TypeError: Cannot read property 'displaytitle' of undefined
[log] [2019-01-14T19:03:18.253Z] All articles were retrieved and saved.
[log] [2019-01-14T19:03:18.253Z] 0 files still to be downloaded.
[log] [2019-01-14T19:03:18.253Z] Process still downloading images...
[log] [2019-01-14T19:03:19.254Z] All images successfuly downloaded
[log] [2019-01-14T19:03:19.254Z] 0 images still to be optimized.
[log] [2019-01-14T19:03:19.254Z] Process still optimizing images...
[log] [2019-01-14T19:03:20.255Z] All images successfuly optimized
[log] [2019-01-14T19:03:20.286Z] Building ZIM file /tmp/out/wiki_en_all_2019-01.zim (zimwriterfs --welcome=index.htm --favicon=favicon.png --language=eng --welcome=Main_Page.html --verbose --tags="gbf" --name="kiwix.wiki_en_all" --redirects="tmp/mwo-dump-1547492566969/wiki_en_all.redirects" --title="Granblue Fantasy Wiki" --description="From Granblue Fantasy Wiki" --creator="Wiki" --publisher="Kiwix" "/tmp/tmp/wiki_en_all_2019-01/" "/tmp/out/wiki_en_all_2019-01.zim")...
[log] [2019-01-14T19:03:20.286Z] RAID: kiwix.wiki_en_all
[error] [2019-01-14T19:03:20.298Z] zimwriterfs: unable to find welcome page at '/tmp/tmp/wiki_en_all_2019-01/Main_Page.html'. --welcome path/value must be relative to HTML_DIRECTORY.
Failed to run mwoffliner after [33s]: "Failed to build successfuly the ZIM file /tmp/out/wiki_en_all_2019-01.zim (Error when executing zimwriterfs)"
[log] [2019-01-14T19:03:20.299Z] Deleting tmp dump dir [tmp/mwo-dump-1547492566969]
@kelson42 What version of Parsoid do you have installed locally? From the looks of the error, it isn't what's requested in the your package.json
@arlolra We are now using MCS as well as Parsoid, this is probably related to the problem
As an aside, it looks like GBF.wiki is struggling with large redirect request URLs... Even though the API max says 500, I think the actual string length is breaking things.
I'm noticing a maximum of 330 redirects page ids exactly is the limit where requests start breaking (character length 5170).
Looks like when we use MCS on the GBF.wiki mainpage we get this error:
{
"status": 504,
"type": "api_error",
"title": "no mobileview in response",
"detail": {
"error": {
"code": "unknown_action",
"info": "Unrecognized value for parameter 'action': mobileview",
"docref": "See https://gbf.wiki/api.php for API usage"
}
},
"method": "GET",
"uri": "/gbf.wiki/v1/page/mobile-sections/Main_Page"
}
I'm not sure how to proceed here... It seems MCS/Parsoid can't help us in this case.
@kelson42 I seem to remember there used to be some logic which would get the non-mobile main-page if necessary, do you think this is sensible?
@ISNIT0 What is the ticket opened on Phabricator/MCS side regarding its problem to scrappe GBF welcome page in mobile?
@ISNIT0 Still failing with the same commend as my previous comment
Error by retrieving article: undefined
[error] [2019-01-20T10:23:40.466Z] Error handling json response from api. Error: No HTML was found
Error by retrieving article: undefined
[error] [2019-01-20T10:23:40.467Z] Error handling json response from api. Error: No HTML was found
Error by retrieving article: undefined
[error] [2019-01-20T10:23:40.467Z] Error handling json response from api. Error: No HTML was found
Error by retrieving article: undefined
[error] [2019-01-20T10:23:40.468Z] Error handling json response from api. Error: No HTML was found
[log] [2019-01-20T10:23:40.468Z] All articles were retrieved and saved.
[log] [2019-01-20T10:23:40.468Z] 0 files still to be downloaded.
[log] [2019-01-20T10:23:40.468Z] Process still downloading images...
[log] [2019-01-20T10:23:41.469Z] All images successfuly downloaded
[log] [2019-01-20T10:23:41.469Z] 0 images still to be optimized.
[log] [2019-01-20T10:23:41.469Z] Process still optimizing images...
[log] [2019-01-20T10:23:42.470Z] All images successfuly optimized
[log] [2019-01-20T10:23:42.506Z] Building ZIM file /tmp/out/wiki_en_all_2019-01.zim (zimwriterfs --welcome=index.htm --favicon=favicon.png --language=eng --welcome=Main_Page.html --verbose --tags="gbf" --name="kiwix.wiki_en_all" --redirects="/tmp/tmp/mwo-dump-1547979772523/wiki_en_all_2019-01.redirects" --title="Granblue Fantasy Wiki" --description="From Granblue Fantasy Wiki" --creator="Wiki" --publisher="Kiwix" "/tmp/tmp/wiki_en_all_2019-01/" "/tmp/out/wiki_en_all_2019-01.zim")...
[log] [2019-01-20T10:23:42.506Z] RAID: kiwix.wiki_en_all
[error] [2019-01-20T10:23:42.515Z] zimwriterfs: unable to find welcome page at '/tmp/tmp/wiki_en_all_2019-01/Main_Page.html'. --welcome path/value must be relative to HTML_DIRECTORY.
Failed to run mwoffliner after [50s]: "Failed to build successfuly the ZIM file /tmp/out/wiki_en_all_2019-01.zim (Error when executing zimwriterfs)"
[log] [2019-01-20T10:23:42.517Z] Deleting tmp dump dir [/tmp/tmp/mwo-dump-1547979772523]
I can't reproduce this. Are you definitely using the latest from master
?
See this MCS issue: https://phabricator.wikimedia.org/T214420
@kelson42 This is now fixed
This is still not working fine. The ZIM file can be done, but we have empty article in paragraphs. For example "Olea Plant".
I can't do a complete scrape of GBF.wiki any more, it seems Parsoid is calling a process.exit
and killing the whole scrape.
I might look into launching Parsoid in a different way so it can't kill MWO, and when it dies, just skip the article and re-start Parsoid This would fix a few different issues, but isn't really doable for 1.9
{"name":"mwoffliner","hostname":"ReeveLaptop.local","pid":64006,"level":40,"logType":"warn/api/main","wiki":"wiki$0","title":"SSR_Character_Tier_List","oldId":85412,"reqId":null,"userAgent":"axios/0.18.0","msg":"Image Info Request Unrecognized parameter: 'iibadfilecontexttitle'","longMsg":"Image Info Request\nUnrecognized parameter: 'iibadfilecontexttitle'","levelPath":"warn/api/main","time":"2019-05-10T18:31:25.083Z","v":0}
{"name":"mwoffliner","hostname":"ReeveLaptop.local","pid":64006,"level":40,"logType":"warn/api/imageinfo","wiki":"wiki$0","title":"SSR_Character_Tier_List","oldId":85412,"reqId":null,"userAgent":"axios/0.18.0","msg":"Image Info Request Unrecognized value for parameter 'iiprop': badfile","longMsg":"Image Info Request\nUnrecognized value for parameter 'iiprop': badfile","levelPath":"warn/api/imageinfo","time":"2019-05-10T18:31:25.084Z","v":0}
{"name":"mwoffliner","hostname":"ReeveLaptop.local","pid":64006,"level":40,"logType":"warn/api/main","wiki":"wiki$0","title":"SSR_Character_Tier_List","oldId":85412,"reqId":null,"userAgent":"axios/0.18.0","msg":"Image Info Request Unrecognized parameter: 'iibadfilecontexttitle'","longMsg":"Image Info Request\nUnrecognized parameter: 'iibadfilecontexttitle'","levelPath":"warn/api/main","time":"2019-05-10T18:31:25.123Z","v":0}
{"name":"mwoffliner","hostname":"ReeveLaptop.local","pid":64006,"level":40,"logType":"warn/api/imageinfo","wiki":"wiki$0","title":"SSR_Character_Tier_List","oldId":85412,"reqId":null,"userAgent":"axios/0.18.0","msg":"Image Info Request Unrecognized value for parameter 'iiprop': badfile","longMsg":"Image Info Request\nUnrecognized value for parameter 'iiprop': badfile","levelPath":"warn/api/imageinfo","time":"2019-05-10T18:31:25.123Z","v":0}
{"name":"mwoffliner","hostname":"ReeveLaptop.local","pid":64006,"level":40,"logType":"warn/api/main","wiki":"wiki$0","title":"SSR_Character_Tier_List","oldId":85412,"reqId":null,"userAgent":"axios/0.18.0","msg":"Image Info Request Unrecognized parameter: 'iibadfilecontexttitle'","longMsg":"Image Info Request\nUnrecognized parameter: 'iibadfilecontexttitle'","levelPath":"warn/api/main","time":"2019-05-10T18:31:25.146Z","v":0}
{"name":"mwoffliner","hostname":"ReeveLaptop.local","pid":64006,"level":40,"logType":"warn/api/imageinfo","wiki":"wiki$0","title":"SSR_Character_Tier_List","oldId":85412,"reqId":null,"userAgent":"axios/0.18.0","msg":"Image Info Request Unrecognized value for parameter 'iiprop': badfile","longMsg":"Image Info Request\nUnrecognized value for parameter 'iiprop': badfile","levelPath":"warn/api/imageinfo","time":"2019-05-10T18:31:25.146Z","v":0}
[log] [2019-05-10T18:31:25.180Z] Exiting with code [1]
@kelson42 I'm removing this from 1.9
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
leads to