tatuylonen / wiktextract

Wiktionary dump file parser and multilingual data extractor
Other
811 stars 83 forks source link

Error: URL not allowed to contain a space #134

Closed wenlinyao closed 2 years ago

wenlinyao commented 2 years ago

Hi, when I ran command:

wiktwords --examples --language English --out data.json enwiktionary-20220420-pages-articles.xml.bz2

I got never-ending error messages like:

[string "Module:quote"]:177: URL not allowed to contain a space, but saw pageurl=https://archive.org/details/mrvvilliamshakes00shak/page/ n258 /mode/1up

I also tried to download your processed files from https://kaikki.org/dictionary/English/index.html, but the website is not responding (super slow). Is it possible to provided a mirror download website? Thank you!

tatuylonen commented 2 years ago

Could you try the website again? It has a 10Gbit connection and I have myself used it from Finland, United States and France without any problems. Perhaps there was some temporary network problem somewhere on the Internet? If the problem continues, where are you located and could you please provide the output of "traceroute kaikki.org" (you may need to install traceroute first, on Ubuntu linux it can be done with "apt install traceroute").

I can find a total of 114 "URL not allowed to contain space" errors (i.e., on 0.001% of pages). Most of them seem to be on VIetnamese pages. I'll leave the issue open and try to look into them later; currently I don't yet know if they are due to a bug or errors in Wiktionary. The error itself is reported by Lua code that comes from Wiktionary but there could be an underlying bug, e.g., in the inputs provided to Lua.