tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
91 stars 23 forks source link

Error: bad argument #1 for 'gsub' (string is not UTF-8) #301

Closed LeMoussel closed 1 week ago

LeMoussel commented 2 weeks ago

Page: Arsène Lupin Error:

Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'auteur1': '[[Jacques Derouard]]', 'auteur2': 'François Busnel', 'auteur3': 'Philippe Delaroche', 'titre': "Comment est né le vrai, l'unique Arsène Lupin", 'périodique': "[[L'Express]]", 'date': '1 septembre 2004', 'lire en ligne': 'https://www.lexpress.fr/culture/livre/comment-est-ne-le-vrai-l-unique-arsene-lupin_809439.html', 'consulté le': '9 décembre 2017'}) at ['Arsène Lupin', 'lien web', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'titre': 'Germanophobie : le retour des revanchards', 'url': 'http://www.slate.fr/story/47141/boches', 'date': '14 décembre 2011', 'site': '[[Slate (magazine)|Slate]]', 'consulté le': '6 octobre 2019'}) at ['Arsène Lupin', 'Lien web', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': 'Maurice Leblanc', 'titre': "Maurice Leblanc nous parle d'Arsène Lupin", 'périodique': 'Gazette de Bayonne, de Biarritz et du Pays basque', 'date': '13 octobre 1932', 'pages': '2', 'lire en ligne': 'https://www.retronews.fr/journal/gazette-de-bayonne-de-biarritz-et-du-pays-basque/13-octobre-1932/343/1242915/2', 'consulté le': '5 décembre 2017'}) at ['Arsène Lupin', 'Citation bloc', 'ARGVAL-2', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': '[[François Forestier]]', 'titre': 'Arsène Lupin est de retour', 'sous-titre': 'qui était vraiment Maurice Leblanc ?', 'périodique': 'BibliObs', 'jour': '11', 'mois': '12', 'année': '2017', 'lire en ligne': 'https://bibliobs.nouvelobs.com/polar/20111215.OBS6821/arsene-lupin-est-de-retour-qui-etait-vraiment-maurice-leblanc.html', 'consulté le': '12 février 2018'}) at ['Arsène Lupin', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': 'Gaston de Pawlowski', 'titre': "L'honnêteté c'est le vol", 'périodique': '[[Comœdia (journal)|Comœdia]]', 'lieu': 'Paris', 'numéro': '394', 'jour': '28', 'mois': '10', 'année': '1908', 'pages': '1', 'lire en ligne': 'https://gallica.bnf.fr/ark:/12148/bpt6k76460477.item', 'consulté le': '13 février 2018'}) at ['Arsène Lupin', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'nom1': 'Robert Belleret', 'titre': 'Au pays de monsieur Lupin', 'périodique': 'Le Monde', 'lieu': 'Paris', 'jour': '22', 'mois': '08', 'année': '2005', 'lire en ligne': 'https://www.lemonde.fr/culture/article/2005/08/22/au-pays-de-monsieur-lupin_681838_3246.html', 'consulté le': '17 février 2018'}) at ['Arsène Lupin', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'périodique': 'Je sais tout', 'lire en ligne': 'https://gallica.bnf.fr/ark:/12148/bpt6k1029808/f720.item', 'jour': '15', 'mois': 'février', 'année': '1907', 'titre': 'Arsène Lupin, Gentleman-Cambrioleur', 'passage': '717'}) at ['Arsène Lupin', '[[link]]', 'article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': 'Pierre Gervasoni', 'titre': 'Avec « Arsène Lupin banquier », les Brigands se refont une santé', 'périodique': '[[lemonde.fr]]', 'date': '27 décembre 2007', 'lire en ligne': 'https://www.lemonde.fr/culture/article/2007/12/27/avec-arsene-lupin-banquier-les-brigands-se-refont-une-sante_993905_3246.html', 'consulté le': '25 mars 2021'}) at ['Arsène Lupin', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'ouvrage') parent ('Modèle:Ouvrage', {'langue': 'en', 'prénom1': 'Jess', 'nom1': 'Nevins', 'préface': 'Alan Moore', 'titre': 'Impossible Territories', 'sous-titre': 'The Unofficial Companion to the League of Extraordinary Gentlemen, The Black Dossier', 'lieu': '[[Austin (Texas)]]', 'éditeur': 'MonkeyBrain', 'mois': 'août', 'année': '2008', 'pages totales': '208', 'isbn': '978-1-932265-24-8'}) at ['Arsène Lupin', 'ouvrage', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Arsène Lupin: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': 'Daniel Aranda', 'titre': 'Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française', 'périodique': "[[Revue d'Histoire littéraire de la France]]", 'volume': '103', 'éditeur': '[[Presses universitaires de France]]', 'date': 'janvier-février 2003', 'isbn': '9782130534655', 'doi': '10.3917/rhlf.031.0111', 'lire en ligne': 'http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm', 'pages': '111-121', 'id': 'aranda2003', 'plume': 'oui', 'issn': '0035-2411'}) at ['Arsène Lupin', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)

Test code :


# https://github.com/tatuylonen/wikitextprocessor/
from wikitextprocessor import Wtp

# https://github.com/tatuylonen/wiktextract
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wiktextract.page import clean_node

wiki_config = WiktionaryConfig()
wiki_config.dump_file_lang_code = "fr"
wiki_config.capture_language_codes = ["fr", "mul"]
wxr = WiktextractContext(
        wtp=Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
        ),
        config=wiki_config,
)

for wiki_page in wxr.wtp.db_conn.execute(
        """
        SELECT title, body
        FROM pages
        WHERE title = 'Arsène Lupin'
        AND redirect_to IS NULL
        """
    ):
        wikipedia_page_title = wiki_page[0]
        wikipedia_page_wikitext = wiki_page[1]
        wxr.wtp.start_page(wikipedia_page_title)

        wiki_nodes = wxr.wtp.parse(text=wikipedia_page_wikitext)
        page_text_content = clean_node(
            wxr=wxr,
            sense_data={},
            wikinode=wiki_nodes,
            collect_links=False,
        )
kristian-clausal commented 2 weeks ago

I'm going to download and reconstruct the database for fr.wikipedia and then take a look at this.

LeMoussel commented 2 weeks ago

Reproduction with this wikitext:

    wikitext= """
* {{Article|auteur1=Daniel Aranda|titre=Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française|périodique=[[Revue d'Histoire littéraire de la France]]|volume=103|éditeur=[[Presses universitaires de France]]|date=janvier-février 2003|isbn=9782130534655|doi=10.3917/rhlf.031.0111|lire en ligne=http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm|pages=111-121|id=aranda2003|plume=oui |issn = 0035-2411 }}
    """
    wxr.wtp.start_page("Test page")
    text = wxr.wtp.expand(wikitext)
    print(text)

Rem: I construct the database from the latest FR dump

kristian-clausal commented 2 weeks ago
$ python testarsene.py                                                              Arsène Lupin: DEBUG: ITALIC not properly closed on the same line at ['Arsène Lupin'] parsing Arsène Lupin/Héritage/Lupinologie
Arsène Lupin: DEBUG: BOLD not properly closed on the same line at ['Arsène Lupin'] parsing Arsène Lupin/Héritage/Lupinologie
Arsène Lupin: DEBUG: ITALIC not properly closed on the same line at ['Arsène Lupin'] parsing Arsène Lupin/Aventures d'Arsène Lupin/Adaptation des aventures d'Arsène Lupin/Pièces de théâtre

Unfortunately couldn't reproduce with the given script. I redownloaded the dump and rebuilt the database file (remember you have to delete the .db file, even with --skip-extraction).

Testing with the snippet:

$ python testarsene2.py

* <span class="ouvrage" id="aranda2003">Daniel Aranda, « <cite style="font-style:noruniversitaires de France]], <abbr class="abbr" title="volume">vol.</abbr>&nbsp;103,&lrm; <time class="nowrap" data-sort-value="2003" datetime="2003">janvier-février 2003</time>, <abbr class="abbr" title="pages">p.</abbr>&nbsp;111-121 <small style="line-height:1em;">([[International Standard Book Number|ISBN]]&nbsp;[[Spécial:Ouvrages de référence/9782130534655|<span class="nowrap">9782130534655</span>]], [[International Standard Serial Number|ISSN]]&nbsp;<span class="plainlinks noarchive">[https://portal.issn.org/resource/issn/0035-2411 0035-2411]</span>, [[Digital Object Identifier|DOI]]&nbsp;<span class="plainlinks noarchive nowrap">[https://dx.doi.org/10.3917/rhlf.031.0111 10.3917/rhlf.031.0111]</span>, [http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm lire en ligne])</small><span class="Z3988" title="ctx_ver=Z39.88-2004&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Maurice+Leblanc+et+la+r%C3%A9surgence+de+la+%C2%AB+s%C3%A9rie+%C2%BB+dans+la+litt%C3%A9rature+romanesque+fran%C3%A7aise&rft.jtitle=Revue+d%27Histoire+litt%C3%A9raire+de+la+France&rft.au=Daniel+Aranda&rft.date=2003&rft.volume=103&rft.pages=111-121&rft.isbn=9782130534655&rft.issn=0035-2411&rft_id=info%3Adoi%2F10.3917%2Frhlf.031.0111&rfr_id=info%3Asid%2Ffr.wikipedia.org%3ATest+page"></span></span>.<span class="nowrap" title="Ouvrage utilisé pour la rédaction de l'article"> [[Fichier:Icon_flatdesign_plume.svg|20px|link=|alt=Ouvrage utilisé pour la rédaction de l'article]]</span>

Please try pulling the newest wikitextprocessor. EDIT: Check if the .db file is actually new, in case the .db construction was skipped.

LeMoussel commented 2 weeks ago

Hmmm... Strange ...

I build the database this way

cd wiktextract
wiktwords --db-path="../fr-wiki-latest.db" --dump-file-language-code "fr" --skip-extraction ../frwiki-latest-pages-articles.xml.bz2
cd ..

And do

$ git pull
Updating c9bbad3..f99c758
Fast-forward
 .github/workflows/lint.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

and I have this error

LeMoussel commented 2 weeks ago

And in fr-wiki-latest.db for Modèle:Article,

SELECT title, namespace_id, redirect_to, need_pre_expand, body, model
FROM pages
WHERE title = 'Modèle:Article'

there is this:

title           namespace_id    need_pre_expand body                            model
Modèle:Article  10      0           {{#invoke:Biblio|article}}  wikitext
kristian-clausal commented 2 weeks ago

The dump file I got is 6300287989 in size, and the .db file generated from it is 23015948288 in size. The dump file size should be the same (and checking hash is not worth it, it's too variable so size should always change between dumps), the .db file is probably not going to be the same size because the process is not in deterministic order but I'm including that info in case it might be and I already copy-pasted it.

LeMoussel commented 2 weeks ago

I have the same size for the dump file (frwiki-latest-pages-articles.xml.bz2) fr-wiki-latest.db size is 23 020 396 544

Doesn't seem to be due to the BD, but rather to different code? Yet wikitextprocessor & wiktextract are up to date.

-e git+https://github.com/tatuylonen/wikitextprocessor.git@f99c7585a16d8039f84080375f4fcc9f3244f6a5#egg=wikitextprocessor
-e git+https://github.com/tatuylonen/wiktextract.git@122811ac909336d2c0fd693175e1b31f53fc6120#egg=wiktextract
kristian-clausal commented 2 weeks ago

If you have installed wiktextract or wikitextprocessor through pip, you might be running those instead of the repo versions.

@xxyzz do you think it could be feasible to add automatic version strings (based on git hashes) into the code that are automatically updated for each commit and which could be printed out ("version xyz of wiktextract, zyx of wikitextprocessor") when running wiktwords?

LeMoussel commented 2 weeks ago

It's a good idea to add automatic version strings

For the installation, I did:

git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m pip install -e .
cd ..
# MàJ: cd wiktextract;git pull;cd ..

git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m pip install -e .
cd ..
# MàJ: cd wikitextprocessor;git pull;cd ..

I will uninstall both packages and reinstall them.

LeMoussel commented 2 weeks ago

Successfully installed wikitextprocessor-0.4.96 wiktextract-1.99.7

pip freeze
...
-e git+https://github.com/tatuylonen/wikitextprocessor.git@f99c7585a16d8039f84080375f4fcc9f3244f6a5#egg=wikitextprocessor
-e git+https://github.com/tatuylonen/wiktextract.git@7411e9f4a4fa515c0028016f7b5732b0db6ed043#egg=wiktextract
...

Hélas, always the mistake. Really weird

LeMoussel commented 2 weeks ago

I got this error in wikitextprocessor\src\wikitextprocessor\core.py expand function line: 1349

            # Use the Lua sandbox to execute a Lua macro.  This will initialize
            # the Lua environment and store it in self.lua if it does not
            # already exist (it needs to be re-created for each new page).
            ret = call_lua_sandbox(self, invoke_args, expander, parent, timeout)

with

invoke_args =('Biblio', 'article')
parent =('Modèle:Article', {'auteur1': 'Daniel Aranda', 'titre': 'Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française', 'périodique': "[[Revue d'Histoire littéraire de la France]]", 'volume': '103', 'éditeur': '[[Presses universitaires de France]]', 'date': 'janvier-février 2003', 'isbn': '9782130534655', 'doi': '10.3917/rhlf.031.0111', 'lire en ligne': 'http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm', 'pages': '111-121', 'id': 'aranda2003', 'plume': 'oui', 'issn': '0035-2411'})
timeout = None
xxyzz commented 2 weeks ago

I also don't see any error in "Arsène Lupin" page...

Some suggestions:

pip freeze and git log could show the commit hash, I think we don't need to show commit in output, it'd be awkward to implement and unnecessary.

LeMoussel commented 2 weeks ago

For the community, some clarifications.

Install wiktextract from local git repo in editable mode.

git clone https://github.com/tatuylonen/wiktextract.git
cd wiktextract
python -m pip install --force-reinstall -e .
cd ..

Check your wiktextract are installed in editable mode

python -m pip show wiktextract

For example

Name: wiktextract
Version: 1.99.7
Summary: Wiktionary dump file parser and multilingual data extractor
Home-page: https://github.com/tatuylonen/wiktextract
Author:
Author-email: Tatu Ylonen <ylo@clausal.com>
License: MIT License
Location: c:\users\appdata\local\programs\python\python310\lib\site-packages
Editable project location: C:\Users\Dev\Python\WikiExtractor\wiktextract
Requires: levenshtein, nltk, pydantic, wikitextprocessor
Required-by

Install wikitextprocessor from local git repo in editable mode.

git clone --recurse-submodules --shallow-submodules https://github.com/tatuylonen/wikitextprocessor.git
cd wikitextprocessor
python -m pip install --force-reinstall -e .
cd ..

Check your wikitextprocessor are installed in editable mode

python -m pip show wikitextprocessor

For example

Name: wikitextprocessor
Version: 0.4.96
Summary: Parser and expander for Wikipedia, Wiktionary etc. dump files, with Lua execution support
Home-page: https://github.com/tatuylonen/wikitextprocessor
Author:
Author-email: Tatu Ylonen <ylo@clausal.com>
License: MIT License
Location: c:\users\appdata\local\programs\python\python310\lib\site-packages
Editable project location: C:\Users\Dev\Python\WikiExtractor\wikitextprocessor
Requires: dateparser, lupa, lxml, mediawiki-langcodes, psutil, requests
Required-by: wiktextrac

Show the commit hash to verify everything is up to date

cd wiktextract
git log -1
commit b78692a725ddc06e5ce7e2cf1ab699aba54218e8 (HEAD -> master, origin/master, origin/HEAD)
Merge: 7411e9f4 0c6b7cc9
Author: xxyzz <gitpull@protonmail.com>
Date:   Wed Aug 28 13:31:26 2024 +0800

    Merge pull request #792 from xxyzz/fr

    [fr] call `parse_section()` recursively and remove "réf" template as tag data

=> commit b78692a725ddc06e5ce7e2cf1ab699aba54218e8

python -m pip freeze | grep wiktextract
-e git+https://github.com/tatuylonen/wiktextract.git@b78692a725ddc06e5ce7e2cf1ab699aba54218e8#egg=wiktextract

=> @b78692a725ddc06e5ce7e2cf1ab699aba54218e8 It's OK

cd wikitextprocessor
git log -1
commit f99c7585a16d8039f84080375f4fcc9f3244f6a5 (HEAD -> main, origin/main, origin/HEAD)
Merge: c9bbad3 3944f36
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Aug 27 00:30:47 2024 +0000

    Merge pull request #300 from tatuylonen/dependabot/github_actions/crate-ci/typos-1.24.1

=> f99c7585a16d8039f84080375f4fcc9f3244f6a5

python -m pip freeze | grep wikitextprocessor
-e git+https://github.com/tatuylonen/wikitextprocessor.git@f99c7585a16d8039f84080375f4fcc9f3244f6a5#egg=wikitextprocessor

=> @f99c7585a16d8039f84080375f4fcc9f3244f6a5 It's OK

LeMoussel commented 2 weeks ago

process_dump about Wikipedia Namespace ID

process_dump(
    wtp,
   "frwiki-latest-pages-articles.xml.bz2",
    namespace_ids # namespace id, can be found at the start of dump file
)

as noted, the namespace ID can be found at the beginning of the "frwiki-latest-pages-articles.xml file.

    <namespaces>
      <namespace key="-2" case="first-letter">Média</namespace>
      <namespace key="-1" case="first-letter">Spécial</namespace>
      <namespace key="0" case="first-letter" />
      <namespace key="1" case="first-letter">Discussion</namespace>
      <namespace key="2" case="first-letter">Utilisateur</namespace>
      <namespace key="3" case="first-letter">Discussion utilisateur</namespace>
      <namespace key="4" case="first-letter">Wikipédia</namespace>
      <namespace key="5" case="first-letter">Discussion Wikipédia</namespace>
      <namespace key="6" case="first-letter">Fichier</namespace>
      <namespace key="7" case="first-letter">Discussion fichier</namespace>
      <namespace key="8" case="first-letter">MediaWiki</namespace>
      <namespace key="9" case="first-letter">Discussion MediaWiki</namespace>
      <namespace key="10" case="first-letter">Modèle</namespace>
      <namespace key="11" case="first-letter">Discussion modèle</namespace>
      <namespace key="12" case="first-letter">Aide</namespace>
      <namespace key="13" case="first-letter">Discussion aide</namespace>
      <namespace key="14" case="first-letter">Catégorie</namespace>
      <namespace key="15" case="first-letter">Discussion catégorie</namespace>
      <namespace key="100" case="first-letter">Portail</namespace>
      <namespace key="101" case="first-letter">Discussion Portail</namespace>
      <namespace key="102" case="first-letter">Projet</namespace>
      <namespace key="103" case="first-letter">Discussion Projet</namespace>
      <namespace key="104" case="first-letter">Référence</namespace>
      <namespace key="105" case="first-letter">Discussion Référence</namespace>
      <namespace key="710" case="first-letter">TimedText</namespace>
      <namespace key="711" case="first-letter">TimedText talk</namespace>
      <namespace key="828" case="first-letter">Module</namespace>
      <namespace key="829" case="first-letter">Discussion module</namespace>
      <namespace key="2600" case="first-letter">Sujet</namespace>
    </namespaces>

Found in the code, another method

    wtp = Wtp(
        db_path="fr-wiki-latest.db",
        lang_code="fr",
        project="wikipedia",
    )

    wiki_config = WiktionaryConfig()
    wiki_config.dump_file_lang_code = "fr"
    wiki_config.capture_language_codes = ["fr", "mul"]
    wxr = WiktextractContext(wtp, wiki_config)

    namespace_ids = {
        wtp.NAMESPACE_DATA.get(name, {}).get("id", 0)
        for name in wxr.config.save_ns_names
    }

which gives the following list {0, 100, 4, 106, 10, 14, 110, 828}

What namespace id list should i use knowing that the values ​​110, 106 do not exist in the dump file?

xxyzz commented 2 weeks ago

Use {0, 10, 828}, you could add other ids if you want to process them.

kristian-clausal commented 2 weeks ago

AFAIK, that's the pages you want to keep in the database file; so if you don't want to collect "Discussion module" pages, that is left out. Modules, Modéles, main pages. EDIT: :ninja:

LeMoussel commented 2 weeks ago
from wikitextprocessor import Wtp
from wikitextprocessor.dumpparser import process_dump

if __name__ == "__main__":
    wtp = Wtp(
        db_path="fr-wiki-latest.db",
        lang_code="fr",
        project="wikipedia",
    )

    namespace_ids = {0,10,828}

    process_dump(
        wtp,
        "frwiki-latest-pages-articles.xml.bz2",
        namespace_ids,
    )

    print(f"# Wikipedia pages collected: {wtp.saved_page_nums()}")
LeMoussel commented 2 weeks ago
....
2024-08-28 12:37:30,167 INFO:   ... 4680000 raw pages collected
2024-08-28 12:51:30,481 INFO: Analyzing which templates should be expanded before parsing
# Wikipedia pages collected: 4684570

fr-wiki-latest.db size is 23 015 948 288. Same as your base @kristian-clausal

Hélas ! After generating a new sqlite database file by calling process_dump() I have the same error.

    wikitext= """
{{Article|auteur1=Daniel Aranda|titre=Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française|périodique=[[Revue d'Histoire littéraire de la France]]|volume=103|éditeur=[[Presses universitaires de France]]|date=janvier-février 2003|isbn=9782130534655|doi=10.3917/rhlf.031.0111|lire en ligne=http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm|pages=111-121|id=aranda2003|plume=oui |issn = 0035-2411 }}    """

    wiki_config = WiktionaryConfig()
    wiki_config.dump_file_lang_code = "fr"
    wiki_config.capture_language_codes = ["fr", "mul"]
    wxr = WiktextractContext(
        wtp=Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
        ),
        config=wiki_config,
    )

   wxr.wtp.start_page("Test page")

    wiki_nodes = wxr.wtp.parse(text=wikitext)
    text = clean_node(
        wxr=wxr,
        sense_data={},
        wikinode=wiki_nodes,
    )
    print(text)
Test page: ERROR: LUA error in #invoke('Biblio', 'article') parent ('Modèle:Article', {'auteur1': 'Daniel Aranda', 'titre': 'Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française', 'périodique': "[[Revue d'Histoire littéraire de la France]]", 'volume': '103', 'éditeur': '[[Presses universitaires de France]]', 'date': 'janvier-février 2003', 'isbn': '9782130534655', 'doi': '10.3917/rhlf.031.0111', 'lire en ligne': 'http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm', 'pages': '111-121', 'id': 'aranda2003', 'plume': 'oui', 'issn': '0035-2411'}) at ['Test page', 'Article', '#invoke', '#invoke']
[string "mw_text"]:81: bad argument #1 for 'gsub' (string is not UTF-8)
Article
kristian-clausal commented 2 weeks ago

I'm sorry, besides double-checking everything, I don't know what could be the cause. I understand your frustration (I myself end up in situations like this a lot, even with wiktextract and wikitextprocessor).

I don't think these have yet to be mentioned in the thread:

For me, it usually turns out it's something like this. It's the Anna Karenina principle, all working cloned repos works the same way, but every broken cloned repo is broken in a unique way...

xxyzz commented 2 weeks ago

It's almost impossible to know the cause of the error without traceback, I could only guess this might be a Windows problem(default encoding is not utf8), try Linux...

LeMoussel commented 2 weeks ago

I suspect this is not an encoding error but as you indicate that it could be a Windows problem. I will try to investigate a little more. Is it possible to set tracebacks?

@kristian-clausal

xxyzz commented 2 weeks ago

The code can't show Lua traceback when error happens inside MedaWiki Lua module due to Lua 5.1 API limitation.

LeMoussel commented 2 weeks ago

I think I found the reason for this error.

Test Code:

    wikitext= """
    {{Article
    |titre=Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française
    |périodique=[[Revue d'Histoire littéraire de la France]]
    |date=janvier-février 2003

    |auteur1=Daniel Aranda
    |volume=103
    |éditeur=[[Presses universitaires de France]]
    |isbn=9782130534655
    |doi=10.3917/rhlf.031.0111
    |lire en ligne=http://www.cairn.info/revue-d-histoire-litteraire-de-la-france-2003-1-page-111.htm
    |pages=111-121
    |id=aranda2003
    |plume=oui
    |issn = 0035-2411
    }}
    """

    wxr.wtp.start_page("Test page")
    wiki_nodes = wxr.wtp.parse(text=wikitext)
    text = clean_node(
        wxr=wxr,
        sense_data={},
        wikinode=wiki_nodes,
    )
    print(text)

produces the error.

If I replace |date=janvier-février 2003 with |date=janvier-fevrier 2003, no more errors with the following text : Daniel Aranda, « Maurice Leblanc et la résurgence de la « série » dans la littérature romanesque française », Revue d'Histoire littéraire de la France, Presses universitaires de France, vol. 103, janvier-février 2003, p. 111-121 (ISBN 9782130534655, ISSN 0035-2411, DOI 10.3917/rhlf.031.0111, lire en ligne). [Alt: Ouvrage utilisé pour la rédaction de l'article]

Note: The date is correctly formatted with the accent vol. 103, janvier-février 2003, p. 111-121

Would accents be misinterpreted under Windows?

In this date field would it be possible to replace the character with an accent with a character without an accent? eg é->e, û->u

xxyzz commented 2 weeks ago

No idea how the hell Windows could screw up the encoding.

I have checked dumparser.py code and python, lxml, sqlite docs, still don't have a clue. Damn Windows.

Could you check the text data in sqlite db are in utf8 encoding and your python code files are also in utf8 encoding?

Do you have "lbzcat" or "bzcat" command installed?

Also check your terminal's encoding.

Maybe you could do a favor for both of us, try Linux...

xxyzz commented 2 weeks ago

Try this see if helps(create new db file):

diff --git a/src/wikitextprocessor/dumpparser.py b/src/wikitextprocessor/dumpparser.py
index 232abd9..16bd340 100644
--- a/src/wikitextprocessor/dumpparser.py
+++ b/src/wikitextprocessor/dumpparser.py
@@ -25,13 +25,16 @@ def decompress_dump_file(
 ) -> Union[subprocess.Popen, bz2.BZ2File]:
     if dump_path.endswith(".bz2"):
         if shutil.which("lbzcat") is None and shutil.which("bzcat") is None:
-            return bz2.open(dump_path, "rb")
+            return bz2.open(dump_path, "rt", encoding="utf-8")

         decompress_command = (
             "lbzcat" if shutil.which("lbzcat") is not None else "bzcat"
         )
         p = subprocess.Popen(
-            [decompress_command, dump_path], stdout=subprocess.PIPE
+            [decompress_command, dump_path],
+            stdout=subprocess.PIPE,
+            text=True,
+            encoding="utf-8",
         )
         if p.stdout is not None:
             return p
kristian-clausal commented 2 weeks ago

If this turns out to be a Windows-specific encoding issue, thank you for bringing it to our attention. Hopefully xxyzz's fix will be applicable!

LeMoussel commented 2 weeks ago

Python code files are also in utf8 encoding: Yes. Check the text data in sqlite db are in utf8 encoding: Yes. I test the encoding with this pragma: PRAGMA encoding; This pragma returns the text encoding: UTF-8

The command "bzcat" is installed on my system, I will test xxyzz's fix.

LeMoussel commented 2 weeks ago

With python this uses bz2.open with bz2.open(dump_path, "rt", encoding="utf-8") I have the error:

  File "GenerateDB.py", line 14, in <module>
    process_dump(
  File "D:\Developpement\Python\WikiExtractor\wikitextprocessor\src\wikitextprocessor\dumpparser.py", line 122, in process_dump
    parse_dump_xml(wtp, path, namespace_ids)
  File "D:\Developpement\Python\WikiExtractor\wikitextprocessor\src\wikitextprocessor\dumpparser.py", line 54, in parse_dump_xml
    for _, page_element in etree.iterparse(
  File "src\\lxml\\iterparse.pxi", line 208, in lxml.etree.iterparse.__next__
  File "src\\lxml\\iterparse.pxi", line 193, in lxml.etree.iterparse.__next__
  File "src\\lxml\\iterparse.pxi", line 221, in lxml.etree.iterparse._read_more_events
TypeError: reading file objects must return bytes objects

Which I don't have with return bz2.open(dump_path, "rb")

xxyzz commented 2 weeks ago

Not check PRAGMA's result, but check the text data's encoding. Python's sqlite3 doc says non-UTF-8 encoding data could be inserted.

Ahh... I'm out of ideas.

LeMoussel commented 2 weeks ago

I will manually extract frwiki-latest-pages-articles.xml.bz2 & call process_dump with skip_extract_dump set to True

xxyzz commented 2 weeks ago

Err... If you set skip_extract_dump to True, then how .bz2 file is decompressed makes no difference, it's not used.

LeMoussel commented 2 weeks ago

OK. frwiki-latest-pages-articles.xml is UTF-8 encoding. I use the Python chardet module.

$ chardetect  frwiki-latest-pages-articles.xml
frwiki-latest-pages-articles.xml: utf-8 with confidence 0.99
LeMoussel commented 2 weeks ago

Same... I'm out of ideas. Maybe a mistake in the LUA code?

I know a little bit about LUA and I see in some LUA modules the code ustring = "ustring:ustring" & local ustring = require("ustring:ustring")

I don't know this syntax with the colon (:) with require

Can you explain it to me?

xxyzz commented 2 weeks ago

I think you need to confirm the encoding of the text data inserted to sqlite db file first.

LeMoussel commented 2 weeks ago

The encoding of the text data inserted to sqlite is UTF-8. I suspect there is a problem between Python/lupa -> LUA code. I did this

wikitextprocessor\src\wikitextprocessor\lua\mw_text.lua

function mw_text.trim(s, charset)
   print(s)
   charset = charset or "\r\n\t\f "
   local ret = mw.ustring.gsub(s, "^[" .. charset .. "]*(.-)[" ..
                                  charset .. "]*$", "%1")
   return ret
end

Test-utf8.py

    wikitext = """
    {{Lien web
    |titre=Germanophobie : le retour des revanchards
    |url=http://www.slate.fr/story/47141/boches
    |date=14 decembre 2011
    |consulté le= 6 octobre 2019
    }}
    """

    wxr.wtp.start_page("Test page")
    wiki_nodes = wxr.wtp.parse(text=wikitext)
    text = clean_node(
        wxr=wxr,
        sense_data={},
        wikinode=wiki_nodes,
    )
    print(text)    

Output:

14 decembre 2011
14 decembre 2011
2011
decembre
decembre
14
2011
d��cembre
Test page: ERROR: LUA error in #invoke('Biblio', 'lienWeb') parent ('Modèle:Lien web', {'titre': 'Germanophobie : le retour des revanchards', 'url': 'http://www.slate.fr/story/47141/boches', 'date': '14 decembre 2011', 'consulté le': '6 octobre 2019'}) at ['Test page', 'Lien web', '#invoke', '#invoke']
[string "mw_text"]:82: bad argument #1 for 'gsub' (string is not UTF-8)

Rhhhooo why d��cembre?

LeMoussel commented 2 weeks ago

This is not a problem between Python/lupa -> LUA code ! But rather in the LUA code itself. But without LUA traceback, it's not easy to debug

Test code:

    wxr.wtp.add_page("Modèle:test-template", 10, "{{#invoke:test|functest}}")
    wxr.wtp.add_page(
        "Module:test",
        828,
        """
        local export = {}

        -- Print contents of `tbl`, with indentation.
        -- `indent` sets the initial level of indentation.
        function tprint (tbl, indent)
            if not indent then indent = 0 end
            for k, v in pairs(tbl) do
                formatting = string.rep("  ", indent) .. k .. ": "
                if type(v) == "table" then
                    print(formatting)
                    tprint(v, indent+1)
                elseif type(v) == 'boolean' then
                    print(formatting .. tostring(v))
                else
                    print(formatting .. v)
                end
            end
        end

        function export.functest(frame)
            local args = frame:getParent().args
            tprint(args)
            return tostring(frame.args[0])
        end

        return export
        """,
        )
    wikitext = """
    {{test-template
    |titre=Germanophobie : le retour des revanchards
    |url=http://www.slate.fr/story/47141/boches
    |date=14 decembre 2011
    |consulté le= 6 octobre 2019
    }}
    """
    wxr.wtp.start_page("")
    expanded = wxr.wtp.expand(wikitext)
    print(expanded)

Output:

date: 14 decembre 2011
url: http://www.slate.fr/story/47141/boches
titre: Germanophobie : le retour des revanchards
consulté le: 6 octobre 2019
xxyzz commented 2 weeks ago

You could edit the called Lua module pages in sqlite db, add some print to find where the error happens. Maybe somewhere the code calls one of our Lua function doesn't handle the encoding properly.

LeMoussel commented 1 week ago

I think I found the reason for this error. Module:Biblio/Lien web Line:268 local dateFormatee = Commun.inscriptionDate( args )

Module:Biblio/Commun Line 488

if date then
   date = date:lower()

14 décembre 2011 to lower -> 14 d��cembre 2011

Paf ! mw.ustring.match doesn't handle the encoding properly. Module:Biblio/Commun Line 498 Which then causes the error Bad argument #1 for 'gsub' (string is not UTF-8) on the contrary, string.match handle the encoding properly. Wikipedia should also have this error and return the date value as is.

Would it be possible to do the same with Wikitext? How to catch LUA errors?

POC

    wiki_config = WiktionaryConfig()
    wiki_config.dump_file_lang_code = "fr"
    wiki_config.capture_language_codes = ["fr", "mul"]
    wxr = WiktextractContext(
        wtp=Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
        ),
        config=wiki_config,
    )

    wxr.wtp.add_page("Modèle:test-template", 10, "{{#invoke:test|functest}}")
    wxr.wtp.add_page(
        "Module:test",
        828,
        """
        local export = {}

        local Lien_web = require( 'Module:Biblio/Lien web' )
        local Commun = require( 'Module:Biblio/Commun' )

        -- Print contents of `tbl`, with indentation.
        -- `indent` sets the initial level of indentation.
        function tprint (tbl, indent)
            if not indent then indent = 0 end
            for k, v in pairs(tbl) do
                formatting = string.rep("  ", indent) .. k .. ": "
                if type(v) == "table" then
                    print(formatting)
                    tprint(v, indent+1)
                elseif type(v) == 'boolean' then
                    print(formatting .. tostring(v))
                else
                    print(formatting .. v)
                end
            end
        end

        function export.functest(frame)
            local args = frame:getParent().args
            tprint(args)

            local date = Commun.validTextArg( args, 'date' )
            date = string.lower(date)
            --local mois, jour, annee =  mw.ustring.match( date, '^([%a]+)%s*(%d%d?)[,%s]+(%d+)$' ) -- ERROR
            local mois, jour, annee =  string.match( date, '^([%a]+)%s*(%d%d?)[,%s]+(%d+)$' ) -- NO ERROR
        end

        return export
        """,
        )
    wikitext = """
    {{test-template
    |titre=Germanophobie : le retour des revanchards
    |url=http://www.slate.fr/story/47141/boches
    |date=14 décembre 2011
    |consulté le= 6 octobre 2019
    }}
    """
    wxr.wtp.start_page("")
    expanded = wxr.wtp.expand(wikitext)
xxyzz commented 1 week ago

We use the same mw.ustring code from Scribunto, the problem is Lua string.lower() can't process unicode string. I think you have to use Linux...

You could try manually change all date:lower() to date:ulower() in Lua code, maybe this could fix the error. But you will have more similar errors elsewhere.

kristian-clausal commented 1 week ago

If Lua unicode manipulation is broken on Windows, that's a problem.

We do a lot of code-manipulation (or did, at least) on Lua code to make it more compatible. Some of it is string manipulation, sometimes we replace lua functions with our own. The problem is always that it's hard to get it all perfectly correct so that nothing breaks; if we replace string.lower() with our own in Python, can we guarantee that it returns a correct value?

xxyzz commented 1 week ago

That's not a problem. Lua string library can't handle unicode, that's why they use ustring library. If we replace it we'll be incompatible with MediaWiki.

kristian-clausal commented 1 week ago

Then why does string.lower() return correct unicode on Linux, on our machines and on fr.wiktionary.org? date seems to be a normal string.

xxyzz commented 1 week ago

IDK why on Linux string.lower() behaves like ustring.lower(), all Wikimedia servers run on Linux so this problem is not noticeable anywhere.

kristian-clausal commented 1 week ago

Do you think it would it be possible to replace string.lower() (and other string methods) with others? We do replacements for Scribunto-specific libraries, but I don't remember and can't quickly find any replacements for basic Lua standard library stuff. We could put it behind a toggle, like --use-unicode-strings.

EDIT: Nevermind, it was in _sandbox_phase1.lua, we replace .gsub with our own.

EDIT: Double nevermind, string.gsub is saved into _orig_gsub for some reason, and never used?

xxyzz commented 1 week ago

I don't recommend wasting more time on this... The whole string library is not meant to handle unicode characters, return unicode characters will cause more problems.

kristian-clausal commented 1 week ago

Tatu said that getting wikitextprocessor/wiktextract working on Windows is a low priority (also considering that multiprocessing doesn't work on Windows), so I guess I'll be closing this issue, unfortunately.

xxyzz commented 1 week ago

Multiprocessing works on Windows now. The problem of this issue is that French Wikipedia Lua module uses a wrong API.

kristian-clausal commented 1 week ago

@LeMoussel if you want to try to figure something out regarding this specifically for this error, take a look at the code in src/wikitextprocessor/luaexec.py and src/wikitextprocessor/lua/_sandbox_phase1.lua and _sandbox_phase2.lua. It might be possible to make a function wrapper around gsub (__orig_gsub being called inside a wrapper function) so that the string is converted back to utf-8 before being fed to the original gsub. There's a bunch of these kinds of functions and wrappers (I think, might have been removed at some point) that you can make in python and pass into the lua-code. This is just a stop-gap measure, however, and it would be pretty messy.

xxyzz commented 1 week ago

That's not a good advice... I don't think you can convert them back to utf8 because the string bytes are somehow changed by string.lower.

The correct action is fixing the wrong Lua code on Wikipedia.

LeMoussel commented 1 week ago

I switched to Linux. No errors.