Closed LeMoussel closed 1 week ago
I'm going to download and reconstruct the database for fr.wikipedia and then take a look at this.
Reproduction with this wikitext:
wikitext= """
* {{Article|auteur1=Daniel Aranda|titre=Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française|périodique=[[Revue d'Histoire littéraire de la France]]|volume=103|éditeur=[[Presses universitaires de France]]|date=janvier-février 2003|isbn=9782130534655|doi=10.3917/rhlf.031.0111|lire en ligne=http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm|pages=111-121|id=aranda2003|plume=oui |issn = 0035-2411 }}
"""
wxr.wtp.start_page("Test page")
text = wxr.wtp.expand(wikitext)
print(text)
Rem: I construct the database from the latest FR dump
$ python testarsene.py Arsène Lupin: DEBUG: ITALIC not properly closed on the same line at ['Arsène Lupin'] parsing Arsène Lupin/Héritage/Lupinologie
Arsène Lupin: DEBUG: BOLD not properly closed on the same line at ['Arsène Lupin'] parsing Arsène Lupin/Héritage/Lupinologie
Arsène Lupin: DEBUG: ITALIC not properly closed on the same line at ['Arsène Lupin'] parsing Arsène Lupin/Aventures d'Arsène Lupin/Adaptation des aventures d'Arsène Lupin/Pièces de théâtre
Unfortunately couldn't reproduce with the given script. I redownloaded the dump and rebuilt the database file (remember you have to delete the .db file, even with --skip-extraction
).
Testing with the snippet:
$ python testarsene2.py
* <span class="ouvrage" id="aranda2003">Daniel Aranda, « <cite style="font-style:noruniversitaires de France]], <abbr class="abbr" title="volume">vol.</abbr> 103,‎ <time class="nowrap" data-sort-value="2003" datetime="2003">janvier-février 2003</time>, <abbr class="abbr" title="pages">p.</abbr> 111-121 <small style="line-height:1em;">([[International Standard Book Number|ISBN]] [[Spécial:Ouvrages de référence/9782130534655|<span class="nowrap">9782130534655</span>]], [[International Standard Serial Number|ISSN]] <span class="plainlinks noarchive">[https://portal.issn.org/resource/issn/0035-2411 0035-2411]</span>, [[Digital Object Identifier|DOI]] <span class="plainlinks noarchive nowrap">[https://dx.doi.org/10.3917/rhlf.031.0111 10.3917/rhlf.031.0111]</span>, [http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm lire en ligne])</small><span class="Z3988" title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Maurice+Leblanc+et+la+r%C3%A9surgence+de+la+%C2%AB+s%C3%A9rie+%C2%BB+dans+la+litt%C3%A9rature+romanesque+fran%C3%A7aise&rft.jtitle=Revue+d%27Histoire+litt%C3%A9raire+de+la+France&rft.au=Daniel+Aranda&rft.date=2003&rft.volume=103&rft.pages=111-121&rft.isbn=9782130534655&rft.issn=0035-2411&rft_id=info%3Adoi%2F10.3917%2Frhlf.031.0111&rfr_id=info%3Asid%2Ffr.wikipedia.org%3ATest+page"></span></span>.<span class="nowrap" title="Ouvrage utilisé pour la rédaction de l'article"> [[Fichier:Icon_flatdesign_plume.svg|20px|link=|alt=Ouvrage utilisé pour la rédaction de l'article]]</span>
Please try pulling the newest wikitextprocessor. EDIT: Check if the .db file is actually new, in case the .db construction was skipped.
Hmmm... Strange ...
I build the database this way
cd wiktextract
wiktwords --db-path="../fr-wiki-latest.db" --dump-file-language-code "fr" --skip-extraction ../frwiki-latest-pages-articles.xml.bz2
cd ..
And do
$ git pull
Updating c9bbad3..f99c758
Fast-forward
.github/workflows/lint.yml | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
and I have this error
And in fr-wiki-latest.db
for Modèle:Article,
SELECT title, namespace_id, redirect_to, need_pre_expand, body, model
FROM pages
WHERE title = 'Modèle:Article'
there is this:
title namespace_id need_pre_expand body model
Modèle:Article 10 0 {{#invoke:Biblio|article}} wikitext
The dump file I got is 6300287989 in size, and the .db file generated from it is 23015948288 in size. The dump file size should be the same (and checking hash is not worth it, it's too variable so size should always change between dumps), the .db file is probably not going to be the same size because the process is not in deterministic order but I'm including that info in case it might be and I already copy-pasted it.
I have the same size for the dump file (frwiki-latest-pages-articles.xml.bz2
)
fr-wiki-latest.db
size is 23 020 396 544
Doesn't seem to be due to the BD, but rather to different code? Yet wikitextprocessor
& wiktextract
are up to date.
-e git+https://github.com/tatuylonen/wikitextprocessor.git@f99c7585a16d8039f84080375f4fcc9f3244f6a5#egg=wikitextprocessor
-e git+https://github.com/tatuylonen/wiktextract.git@122811ac909336d2c0fd693175e1b31f53fc6120#egg=wiktextract
If you have installed wiktextract or wikitextprocessor through pip
, you might be running those instead of the repo versions.
@xxyzz do you think it could be feasible to add automatic version strings (based on git hashes) into the code that are automatically updated for each commit and which could be printed out ("version xyz of wiktextract, zyx of wikitextprocessor") when running wiktwords
?
It's a good idea to add automatic version strings
For the installation, I did:
git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m pip install -e .
cd ..
# MàJ: cd wiktextract;git pull;cd ..
git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m pip install -e .
cd ..
# MàJ: cd wikitextprocessor;git pull;cd ..
I will uninstall both packages and reinstall them.
Successfully installed wikitextprocessor-0.4.96 wiktextract-1.99.7
pip freeze
...
-e git+https://github.com/tatuylonen/wikitextprocessor.git@f99c7585a16d8039f84080375f4fcc9f3244f6a5#egg=wikitextprocessor
-e git+https://github.com/tatuylonen/wiktextract.git@7411e9f4a4fa515c0028016f7b5732b0db6ed043#egg=wiktextract
...
Hélas, always the mistake. Really weird
I got this error in wikitextprocessor\src\wikitextprocessor\core.py expand function line: 1349
# Use the Lua sandbox to execute a Lua macro. This will initialize
# the Lua environment and store it in self.lua if it does not
# already exist (it needs to be re-created for each new page).
ret = call_lua_sandbox(self, invoke_args, expander, parent, timeout)
with
invoke_args =('Biblio', 'article')
parent =('Modèle:Article', {'auteur1': 'Daniel Aranda', 'titre': 'Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française', 'périodique': "[[Revue d'Histoire littéraire de la France]]", 'volume': '103', 'éditeur': '[[Presses universitaires de France]]', 'date': 'janvier-février 2003', 'isbn': '9782130534655', 'doi': '10.3917/rhlf.031.0111', 'lire en ligne': 'http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm', 'pages': '111-121', 'id': 'aranda2003', 'plume': 'oui', 'issn': '0035-2411'})
timeout = None
I also don't see any error in "Arsène Lupin" page...
Some suggestions:
editable
mode: python -m pip show wikitextprocessor
should display the local git repo path. wikitextprocessor must be install with --force-reinstall
if it's already installed as the dependency of wiktextract, this is documented here: https://github.com/tatuylonen/wiktextract?tab=readme-ov-file#install-from-source wiktextract always install wikitextprocessor from the latest commit but not in editable modeprocess_dump()
:
process_dump(wtp, "xml.bz2_path", {0, 10, 828}, skip_analyze_templates=True)
wiktwords
command to process non-Wiktionary dump file, it could overwrite some templates. Don't do this.<ref>
tag, and this tag is removed in clean_node()
. If you still have this error, you could ignore it. The final expanded text should be the same.pip freeze
and git log
could show the commit hash, I think we don't need to show commit in output, it'd be awkward to implement and unnecessary.
For the community, some clarifications.
Install wiktextract
from local git repo in editable mode.
git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m pip install --force-reinstall -e .
cd ..
Check your wiktextract
are installed in editable mode
python -m pip show wiktextract
For example
Name: wiktextract
Version: 1.99.7
Summary: Wiktionary dump file parser and multilingual data extractor
Home-page: https://github.com/tatuylonen/wiktextract
Author:
Author-email: Tatu Ylonen <ylo@clausal.com>
License: MIT License
Location: c:\users\appdata\local\programs\python\python310\lib\site-packages
Editable project location: C:\Users\Dev\Python\WikiExtractor\wiktextract
Requires: levenshtein, nltk, pydantic, wikitextprocessor
Required-by
Install wikitextprocessor
from local git repo in editable mode.
git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m pip install --force-reinstall -e .
cd ..
Check your wikitextprocessor
are installed in editable mode
python -m pip show wikitextprocessor
For example
Name: wikitextprocessor
Version: 0.4.96
Summary: Parser and expander for Wikipedia, Wiktionary etc. dump files, with Lua execution support
Home-page: https://github.com/tatuylonen/wikitextprocessor
Author:
Author-email: Tatu Ylonen <ylo@clausal.com>
License: MIT License
Location: c:\users\appdata\local\programs\python\python310\lib\site-packages
Editable project location: C:\Users\Dev\Python\WikiExtractor\wikitextprocessor
Requires: dateparser, lupa, lxml, mediawiki-langcodes, psutil, requests
Required-by: wiktextrac
Show the commit hash to verify everything is up to date
cd wiktextract
git log -1
commit b78692a725ddc06e5ce7e2cf1ab699aba54218e8 (HEAD -> master, origin/master, origin/HEAD)
Merge: 7411e9f4 0c6b7cc9
Author: xxyzz <gitpull@protonmail.com>
Date: Wed Aug 28 13:31:26 2024 +0800
Merge pull request #792 from xxyzz/fr
[fr] call `parse_section()` recursively and remove "réf" template as tag data
=> commit b78692a725ddc06e5ce7e2cf1ab699aba54218e8
python -m pip freeze | grep wiktextract
-e git+https://github.com/tatuylonen/wiktextract.git@b78692a725ddc06e5ce7e2cf1ab699aba54218e8#egg=wiktextract
=> @b78692a725ddc06e5ce7e2cf1ab699aba54218e8
It's OK
cd wikitextprocessor
git log -1
commit f99c7585a16d8039f84080375f4fcc9f3244f6a5 (HEAD -> main, origin/main, origin/HEAD)
Merge: c9bbad3 3944f36
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date: Tue Aug 27 00:30:47 2024 +0000
Merge pull request #300 from tatuylonen/dependabot/github_actions/crate-ci/typos-1.24.1
=> f99c7585a16d8039f84080375f4fcc9f3244f6a5
python -m pip freeze | grep wikitextprocessor
-e git+https://github.com/tatuylonen/wikitextprocessor.git@f99c7585a16d8039f84080375f4fcc9f3244f6a5#egg=wikitextprocessor
=> @f99c7585a16d8039f84080375f4fcc9f3244f6a5
It's OK
process_dump about Wikipedia Namespace ID
process_dump(
wtp,
"frwiki-latest-pages-articles.xml.bz2",
namespace_ids # namespace id, can be found at the start of dump file
)
as noted, the namespace ID can be found at the beginning of the "frwiki-latest-pages-articles.xml
file.
<namespaces>
<namespace key="-2" case="first-letter">Média</namespace>
<namespace key="-1" case="first-letter">Spécial</namespace>
<namespace key="0" case="first-letter" />
<namespace key="1" case="first-letter">Discussion</namespace>
<namespace key="2" case="first-letter">Utilisateur</namespace>
<namespace key="3" case="first-letter">Discussion utilisateur</namespace>
<namespace key="4" case="first-letter">Wikipédia</namespace>
<namespace key="5" case="first-letter">Discussion Wikipédia</namespace>
<namespace key="6" case="first-letter">Fichier</namespace>
<namespace key="7" case="first-letter">Discussion fichier</namespace>
<namespace key="8" case="first-letter">MediaWiki</namespace>
<namespace key="9" case="first-letter">Discussion MediaWiki</namespace>
<namespace key="10" case="first-letter">Modèle</namespace>
<namespace key="11" case="first-letter">Discussion modèle</namespace>
<namespace key="12" case="first-letter">Aide</namespace>
<namespace key="13" case="first-letter">Discussion aide</namespace>
<namespace key="14" case="first-letter">Catégorie</namespace>
<namespace key="15" case="first-letter">Discussion catégorie</namespace>
<namespace key="100" case="first-letter">Portail</namespace>
<namespace key="101" case="first-letter">Discussion Portail</namespace>
<namespace key="102" case="first-letter">Projet</namespace>
<namespace key="103" case="first-letter">Discussion Projet</namespace>
<namespace key="104" case="first-letter">Référence</namespace>
<namespace key="105" case="first-letter">Discussion Référence</namespace>
<namespace key="710" case="first-letter">TimedText</namespace>
<namespace key="711" case="first-letter">TimedText talk</namespace>
<namespace key="828" case="first-letter">Module</namespace>
<namespace key="829" case="first-letter">Discussion module</namespace>
<namespace key="2600" case="first-letter">Sujet</namespace>
</namespaces>
Found in the code, another method
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
)
wiki_config = WiktionaryConfig()
wiki_config.dump_file_lang_code = "fr"
wiki_config.capture_language_codes = ["fr", "mul"]
wxr = WiktextractContext(wtp, wiki_config)
namespace_ids = {
wtp.NAMESPACE_DATA.get(name, {}).get("id", 0)
for name in wxr.config.save_ns_names
}
which gives the following list {0, 100, 4, 106, 10, 14, 110, 828}
What namespace id
list should i use knowing that the values 110, 106 do not exist in the dump file?
Use {0, 10, 828}
, you could add other ids if you want to process them.
AFAIK, that's the pages you want to keep in the database file; so if you don't want to collect "Discussion module" pages, that is left out. Modules, Modéles, main pages. EDIT: :ninja:
from wikitextprocessor import Wtp
from wikitextprocessor.dumpparser import process_dump
if __name__ == "__main__":
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
)
namespace_ids = {0,10,828}
process_dump(
wtp,
"frwiki-latest-pages-articles.xml.bz2",
namespace_ids,
)
print(f"# Wikipedia pages collected: {wtp.saved_page_nums()}")
....
2024-08-28 12:37:30,167 INFO: ... 4680000 raw pages collected
2024-08-28 12:51:30,481 INFO: Analyzing which templates should be expanded before parsing
# Wikipedia pages collected: 4684570
fr-wiki-latest.db
size is 23 015 948 288. Same as your base @kristian-clausal
Hélas ! After generating a new sqlite database file by calling process_dump()
I have the same error.
wikitext= """
{{Article|auteur1=Daniel Aranda|titre=Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française|périodique=[[Revue d'Histoire littéraire de la France]]|volume=103|éditeur=[[Presses universitaires de France]]|date=janvier-février 2003|isbn=9782130534655|doi=10.3917/rhlf.031.0111|lire en ligne=http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm|pages=111-121|id=aranda2003|plume=oui |issn = 0035-2411 }} """
wiki_config = WiktionaryConfig()
wiki_config.dump_file_lang_code = "fr"
wiki_config.capture_language_codes = ["fr", "mul"]
wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
),
config=wiki_config,
)
wxr.wtp.start_page("Test page")
wiki_nodes = wxr.wtp.parse(text=wikitext)
text = clean_node(
wxr=wxr,
sense_data={},
wikinode=wiki_nodes,
)
print(text)
Test page: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': 'Daniel Aranda', 'titre': 'Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française', 'périodique': "[[Revue d'Histoire littéraire de la France]]", 'volume': '103', 'éditeur': '[[Presses universitaires de France]]', 'date': 'janvier-février 2003', 'isbn': '9782130534655', 'doi': '10.3917/rhlf.031.0111', 'lire en ligne': 'http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm', 'pages': '111-121', 'id': 'aranda2003', 'plume': 'oui', 'issn': '0035-2411'}) at ['Test page', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Article
I'm sorry, besides double-checking everything, I don't know what could be the cause. I understand your frustration (I myself end up in situations like this a lot, even with wiktextract and wikitextprocessor).
I don't think these have yet to be mentioned in the thread:
venv
?fr-wiki-latest.db
?For me, it usually turns out it's something like this. It's the Anna Karenina principle, all working cloned repos works the same way, but every broken cloned repo is broken in a unique way...
It's almost impossible to know the cause of the error without traceback, I could only guess this might be a Windows problem(default encoding is not utf8), try Linux...
I suspect this is not an encoding error but as you indicate that it could be a Windows problem. I will try to investigate a little more. Is it possible to set tracebacks?
@kristian-clausal
The code can't show Lua traceback when error happens inside MedaWiki Lua module due to Lua 5.1 API limitation.
I think I found the reason for this error.
Test Code:
wikitext= """
{{Article
|titre=Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française
|périodique=[[Revue d'Histoire littéraire de la France]]
|date=janvier-février 2003
|auteur1=Daniel Aranda
|volume=103
|éditeur=[[Presses universitaires de France]]
|isbn=9782130534655
|doi=10.3917/rhlf.031.0111
|lire en ligne=http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm
|pages=111-121
|id=aranda2003
|plume=oui
|issn = 0035-2411
}}
"""
wxr.wtp.start_page("Test page")
wiki_nodes = wxr.wtp.parse(text=wikitext)
text = clean_node(
wxr=wxr,
sense_data={},
wikinode=wiki_nodes,
)
print(text)
produces the error.
If I replace |date=janvier-février 2003
with |date=janvier-fevrier 2003
, no more errors with the following text :
Daniel Aranda, « Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française », Revue d'Histoire littéraire de la France, Presses universitaires de France, vol. 103, janvier-février 2003, p. 111-121 (ISBN 9782130534655, ISSN 0035-2411, DOI 10.3917/rhlf.031.0111, lire en ligne). [Alt: Ouvrage utilisé pour la rédaction de l'article]
Note: The date is correctly formatted with the accent vol. 103, janvier-février 2003, p. 111-121
Would accents be misinterpreted under Windows?
In this date field would it be possible to replace the character with an accent with a character without an accent? eg é->e, û->u
No idea how the hell Windows could screw up the encoding.
I have checked dumparser.py
code and python, lxml, sqlite docs, still don't have a clue. Damn Windows.
Could you check the text data in sqlite db are in utf8 encoding and your python code files are also in utf8 encoding?
Do you have "lbzcat" or "bzcat" command installed?
Also check your terminal's encoding.
Maybe you could do a favor for both of us, try Linux...
Try this see if helps(create new db file):
diff --git a/src/wikitextprocessor/dumpparser.py b/src/wikitextprocessor/dumpparser.py
index 232abd9..16bd340 100644
--- a/src/wikitextprocessor/dumpparser.py
+++ b/src/wikitextprocessor/dumpparser.py
@@ -25,13 +25,16 @@ def decompress_dump_file(
) -> Union[subprocess.Popen, bz2.BZ2File]:
if dump_path.endswith(".bz2"):
if shutil.which("lbzcat") is None and shutil.which("bzcat") is None:
- return bz2.open(dump_path, "rb")
+ return bz2.open(dump_path, "rt", encoding="utf-8")
decompress_command = (
"lbzcat" if shutil.which("lbzcat") is not None else "bzcat"
)
p = subprocess.Popen(
- [decompress_command, dump_path], stdout=subprocess.PIPE
+ [decompress_command, dump_path],
+ stdout=subprocess.PIPE,
+ text=True,
+ encoding="utf-8",
)
if p.stdout is not None:
return p
If this turns out to be a Windows-specific encoding issue, thank you for bringing it to our attention. Hopefully xxyzz's fix will be applicable!
Python code files are also in utf8 encoding: Yes.
Check the text data in sqlite db are in utf8 encoding: Yes.
I test the encoding with this pragma: PRAGMA encoding;
This pragma returns the text encoding: UTF-8
The command "bzcat" is installed on my system, I will test xxyzz's fix.
With python this uses bz2.open
with bz2.open(dump_path, "rt", encoding="utf-8")
I have the error:
File "GenerateDB.py", line 14, in <module>
process_dump(
File "D:\Developpement\Python\WikiExtractor\wikitextprocessor\src\wikitextprocessor\dumpparser.py", line 122, in process_dump
parse_dump_xml(wtp, path, namespace_ids)
File "D:\Developpement\Python\WikiExtractor\wikitextprocessor\src\wikitextprocessor\dumpparser.py", line 54, in parse_dump_xml
for _, page_element in etree.iterparse(
File "src\\lxml\\iterparse.pxi", line 208, in lxml.etree.iterparse.__next__
File "src\\lxml\\iterparse.pxi", line 193, in lxml.etree.iterparse.__next__
File "src\\lxml\\iterparse.pxi", line 221, in lxml.etree.iterparse._read_more_events
TypeError: reading file objects must return bytes objects
Which I don't have with return bz2.open(dump_path, "rb")
Not check PRAGMA
's result, but check the text data's encoding. Python's sqlite3 doc says non-UTF-8 encoding data could be inserted.
Ahh... I'm out of ideas.
I will manually extract frwiki-latest-pages-articles.xml.bz2
& call process_dump
with skip_extract_dump
set to True
Err... If you set skip_extract_dump
to True
, then how .bz2 file is decompressed makes no difference, it's not used.
OK.
frwiki-latest-pages-articles.xml
is UTF-8 encoding.
I use the Python chardet module.
$ chardetect frwiki-latest-pages-articles.xml
frwiki-latest-pages-articles.xml: utf-8 with confidence 0.99
Same... I'm out of ideas. Maybe a mistake in the LUA code?
I know a little bit about LUA and I see in some LUA modules the code ustring = "ustring:ustring"
& local ustring = require("ustring:ustring")
I don't know this syntax with the colon (:) with require
Can you explain it to me?
I think you need to confirm the encoding of the text data inserted to sqlite db file first.
The encoding of the text data inserted to sqlite is UTF-8. I suspect there is a problem between Python/lupa -> LUA code. I did this
wikitextprocessor\src\wikitextprocessor\lua\mw_text.lua
function mw_text.trim(s, charset)
print(s)
charset = charset or "\r\n\t\f "
local ret = mw.ustring.gsub(s, "^[" .. charset .. "]*(.-)[" ..
charset .. "]*$", "%1")
return ret
end
Test-utf8.py
wikitext = """
{{Lien web
|titre=Germanophobie : le retour des revanchards
|url=http://www.slate.fr/story/47141/boches
|date=14 decembre 2011
|consulté le= 6 octobre 2019
}}
"""
wxr.wtp.start_page("Test page")
wiki_nodes = wxr.wtp.parse(text=wikitext)
text = clean_node(
wxr=wxr,
sense_data={},
wikinode=wiki_nodes,
)
print(text)
Output:
14 decembre 2011
14 decembre 2011
2011
decembre
decembre
14
2011
d��cembre
Test page: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'titre': 'Germanophobie : le retour des revanchards', 'url': 'http://www.slate.fr/story/47141/boches', 'date': '14 decembre 2011', 'consulté le': '6 octobre 2019'}) at ['Test page', 'Lien web', '#invoke', '#invoke']
[string "mw_text"]:82: bad argument #1 for 'gsub' (string is not UTF-8)
Rhhhooo why d��cembre
?
This is not a problem between Python/lupa -> LUA code ! But rather in the LUA code itself. But without LUA traceback, it's not easy to debug
Test code:
wxr.wtp.add_page("Modèle:test-template", 10, "{{#invoke:test|functest}}")
wxr.wtp.add_page(
"Module:test",
828,
"""
local export = {}
-- Print contents of `tbl`, with indentation.
-- `indent` sets the initial level of indentation.
function tprint (tbl, indent)
if not indent then indent = 0 end
for k, v in pairs(tbl) do
formatting = string.rep(" ", indent) .. k .. ": "
if type(v) == "table" then
print(formatting)
tprint(v, indent+1)
elseif type(v) == 'boolean' then
print(formatting .. tostring(v))
else
print(formatting .. v)
end
end
end
function export.functest(frame)
local args = frame:getParent().args
tprint(args)
return tostring(frame.args[0])
end
return export
""",
)
wikitext = """
{{test-template
|titre=Germanophobie : le retour des revanchards
|url=http://www.slate.fr/story/47141/boches
|date=14 decembre 2011
|consulté le= 6 octobre 2019
}}
"""
wxr.wtp.start_page("")
expanded = wxr.wtp.expand(wikitext)
print(expanded)
Output:
date: 14 decembre 2011
url: http://www.slate.fr/story/47141/boches
titre: Germanophobie : le retour des revanchards
consulté le: 6 octobre 2019
You could edit the called Lua module pages in sqlite db, add some print
to find where the error happens. Maybe somewhere the code calls one of our Lua function doesn't handle the encoding properly.
I think I found the reason for this error.
Module:Biblio/Lien web Line:268
local dateFormatee = Commun.inscriptionDate( args )
Module:Biblio/Commun Line 488
if date then
date = date:lower()
14 décembre 2011 to lower -> 14 d��cembre 2011
Paf ! mw.ustring.match
doesn't handle the encoding properly. Module:Biblio/Commun Line 498
Which then causes the error Bad argument #1 for 'gsub' (string is not UTF-8)
on the contrary, string.match
handle the encoding properly.
Wikipedia should also have this error and return the date value as is.
Would it be possible to do the same with Wikitext? How to catch LUA errors?
POC
wiki_config = WiktionaryConfig()
wiki_config.dump_file_lang_code = "fr"
wiki_config.capture_language_codes = ["fr", "mul"]
wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
),
config=wiki_config,
)
wxr.wtp.add_page("Modèle:test-template", 10, "{{#invoke:test|functest}}")
wxr.wtp.add_page(
"Module:test",
828,
"""
local export = {}
local Lien_web = require( 'Module:Biblio/Lien web' )
local Commun = require( 'Module:Biblio/Commun' )
-- Print contents of `tbl`, with indentation.
-- `indent` sets the initial level of indentation.
function tprint (tbl, indent)
if not indent then indent = 0 end
for k, v in pairs(tbl) do
formatting = string.rep(" ", indent) .. k .. ": "
if type(v) == "table" then
print(formatting)
tprint(v, indent+1)
elseif type(v) == 'boolean' then
print(formatting .. tostring(v))
else
print(formatting .. v)
end
end
end
function export.functest(frame)
local args = frame:getParent().args
tprint(args)
local date = Commun.validTextArg( args, 'date' )
date = string.lower(date)
--local mois, jour, annee = mw.ustring.match( date, '^([%a]+)%s*(%d%d?)[,%s]+(%d+)$' ) -- ERROR
local mois, jour, annee = string.match( date, '^([%a]+)%s*(%d%d?)[,%s]+(%d+)$' ) -- NO ERROR
end
return export
""",
)
wikitext = """
{{test-template
|titre=Germanophobie : le retour des revanchards
|url=http://www.slate.fr/story/47141/boches
|date=14 décembre 2011
|consulté le= 6 octobre 2019
}}
"""
wxr.wtp.start_page("")
expanded = wxr.wtp.expand(wikitext)
We use the same mw.ustring
code from Scribunto, the problem is Lua string.lower()
can't process unicode string. I think you have to use Linux...
You could try manually change all date:lower()
to date:ulower()
in Lua code, maybe this could fix the error. But you will have more similar errors elsewhere.
If Lua unicode manipulation is broken on Windows, that's a problem.
We do a lot of code-manipulation (or did, at least) on Lua code to make it more compatible. Some of it is string manipulation, sometimes we replace lua functions with our own. The problem is always that it's hard to get it all perfectly correct so that nothing breaks; if we replace string.lower()
with our own in Python, can we guarantee that it returns a correct value?
That's not a problem. Lua string
library can't handle unicode, that's why they use ustring
library. If we replace it we'll be incompatible with MediaWiki.
Then why does string.lower()
return correct unicode on Linux, on our machines and on fr.wiktionary.org? date
seems to be a normal string
.
IDK why on Linux string.lower()
behaves like ustring.lower()
, all Wikimedia servers run on Linux so this problem is not noticeable anywhere.
Do you think it would it be possible to replace string.lower()
(and other string
methods) with others? We do replacements for Scribunto-specific libraries, but I don't remember and can't quickly find any replacements for basic Lua standard library stuff. We could put it behind a toggle, like --use-unicode-strings
.
EDIT: Nevermind, it was in _sandbox_phase1.lua, we replace .gsub with our own.
EDIT: Double nevermind, string.gsub is saved into _orig_gsub for some reason, and never used?
I don't recommend wasting more time on this... The whole string
library is not meant to handle unicode characters, return unicode characters will cause more problems.
Tatu said that getting wikitextprocessor/wiktextract working on Windows is a low priority (also considering that multiprocessing doesn't work on Windows), so I guess I'll be closing this issue, unfortunately.
Multiprocessing works on Windows now. The problem of this issue is that French Wikipedia Lua module uses a wrong API.
@LeMoussel if you want to try to figure something out regarding this specifically for this error, take a look at the code in src/wikitextprocessor/luaexec.py
and src/wikitextprocessor/lua/_sandbox_phase1.lua
and _sandbox_phase2.lua
. It might be possible to make a function wrapper around gsub
(__orig_gsub
being called inside a wrapper function) so that the string is converted back to utf-8 before being fed to the original gsub
. There's a bunch of these kinds of functions and wrappers (I think, might have been removed at some point) that you can make in python and pass into the lua-code. This is just a stop-gap measure, however, and it would be pretty messy.
That's not a good advice... I don't think you can convert them back to utf8 because the string bytes are somehow changed by string.lower
.
The correct action is fixing the wrong Lua code on Wikipedia.
I switched to Linux. No errors.
Page: Arsène Lupin Error:
Test code :