tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
94 stars 23 forks source link

Checklist-1 for existing errors. #226

Closed LeMoussel closed 7 months ago

LeMoussel commented 8 months ago

Attached CSV file: wiki_errors.csv, listing, by Wikipedia article title, other errors than those indicated in the issues #225, #224, #223, #220 & #216:

In summary, there are the following errors:

LeMoussel commented 8 months ago

For unimplemented parserfn PROTECTIONLEVEL, I suggest this correction in parsefn.py

....
    "PROTECTIONLEVEL": protectionlevel_fn, #unimplemented_fn,
....

def protectionlevel_fn(
    ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
    """Implements the PROTECTIONLEVEL magic word."""
    # Returns an empty string to indicate that the page is not protected."""
    return ""
kristian-clausal commented 8 months ago

If you don't want to make a pull request, I'll implement that. It seems the most sensible approach here, yeah.

LeMoussel commented 8 months ago

I don't know GitHub well to make a pull request. I'll let you do it. Thanks.

kristian-clausal commented 8 months ago

Pushed a commit to the #filepath PR (might as well lump them together).

kristian-clausal commented 8 months ago

@LeMoussel I've now gone through and committed fixes to most issues. The CSV had two new issues, mw.ext.data.get not being implemented (it's an extension that isn't used in Wiktionary) and getBadges not being implemented (it's a new function introduced in 2022~2023).

PROTECTIONLEVEL, #property should also be handled. If you could check out all of these issues and see if they work on your end now, that would be grand.

LeMoussel commented 8 months ago

First of all, let me congratulate you both. You do a hell of a job! THANKS.

So I updated both packages and ran an analysis process on the first 1,000 articles in the database. Attached is a CSV file of the errors and/or warnings encountered: wiki_errors.csv

The number of errors/warnings by wording is summarized below:

2024-03-06 14:58:03 ERROR    1: LUA error in #invoke('Mapframe', 'main')
2024-03-06 14:58:03 ERROR    1: LUA error in #invoke('Jumelages', 'tableauDesJumelages')
2024-03-06 14:58:03 ERROR    2: LUA error in #invoke('Excerpt', 'main', ' only = \U00102195', ' files = ', ' lists = ', ' templates = ', ' paragraphs = ', ' references = ', ' subsections = ', ' bold = ', ' more = ', ' hat = ', ' this = ', ' quote = ', ' inline = ')
2024-03-06 14:58:03 ERROR    2: LUA error in #invoke('Titulaires', 'tableauDesDirigeants')
2024-03-06 14:58:03 ERROR    16: LUA error in #invoke('Durée', 'duree')
2024-03-06 14:58:03 ERROR    10: LUA error in #invoke('Graph', 'chartWrapper')
2024-03-06 14:58:03 ERROR    13: LUA error in #invoke('Durée', 'duree', 'en année=1')
2024-03-06 14:58:03 ERROR    2: TimeOut
2024-03-06 14:58:03 WARNING  51: #tag creating non-allowed tag <maplink> - omitted
2024-03-06 14:58:03 WARNING  19: #tag creating non-allowed tag <poem> - omitted
2024-03-06 14:58:03 WARNING  10: #tag creating non-allowed tag <mapframe> - omitted
2024-03-06 14:58:03 WARNING  6: #tag creating non-allowed tag <graph> - omitted
2024-03-06 14:58:03 WARNING  1: invalid attribute format '' missing name
2024-03-06 14:58:03 WARNING  2: #tag creating non-allowed tag <timeline> - omitted

Previous errors all seem to be resolved. Goob job !

LeMoussel commented 8 months ago

TimeOut error is due to the fact that the analysis of the article is longer than 30 seconds. This matches the article Dreamcast & Écriture hiéroglyphique égyptienne. It should be noted that the articles is important. I don't know if this long processing time (> 30 sec) is normal.

LeMoussel commented 8 months ago

There seems to me to be a regression in the generated text. Indeed, in certain text, there is the presence of class=noviewer. Unless I'm mistaken, this was not present before. For example, we find this for Algèbre générale, Algèbre linéaire, Arc de triomphe de l'Étoile

I found why. See https://github.com/tatuylonen/wikitextprocessor/issues/225#issuecomment-1985213379

LeMoussel commented 8 months ago

What is surprising is that there are quite a few errors on unrecognized parser function '#invoque' Is this a typo (invoque vs invoke) in the articles?

kristian-clausal commented 8 months ago

If the article doesn't look right, then it's probably a typo, but it is also very possible that French wikipedia just... accepts invoque. I'm going to take a look, hopefully it's the first.

kristian-clausal commented 8 months ago

Please give data on #invoque errors, there are none in the new csv.

LeMoussel commented 8 months ago

invoque.csv

invoque-utf8.csv

kristian-clausal commented 8 months ago

Please encode your file in UTF-8 for maximum portability.

kristian-clausal commented 8 months ago

Just to talk about most of the stuff in the previous csv (not the invoque stuff), any error about #tag creating non-allowed tag has already been 'resolved' in https://github.com/tatuylonen/wikitextprocessor/issues/209

You need to create extension tag data (just copy paste and repeat the stuff seen in that thread) when you process it. These are extensions, not core Wikimedia stuff, so we can't 100% predict what will turn up in a tag, so I made a (maybe just temporary) parameter in extension_tags that takes extension tag information and enables you to parse things like <maplinks>, <poem>, <graph> etc. Relevant code here: https://github.com/tatuylonen/wikitextprocessor/issues/209#issuecomment-1961082816

The 'html-like' tags will now be parsed, and you can handle them further (for example, you can ignore them) by using a node_handler function passed into other functions.

kristian-clausal commented 8 months ago

The issue with #invoque was the more annoying one (ie. we have to do some work), and fr.wikipedia obviously allows it as an alias for #invoke. I made a PR with some simple changes, it shouldn't break anything (but it's just a smidge less efficient because we're using "something in set_of_strings" instead of "something == string"), so I'll probably merge it toot sweet.

In this case, you need to either initialize Wtp with invoke_aliases, a set of strings that stand for aliases of #invoke, or you can modify Wtp.invoke_aliases like I did for the test in the pull request. Wtp.invoke_aliases is a set of strings, so you can use Wtp.invoke_aliases = Wtp.invoke_aliases | {"#invoque"} (with the #) to modify it, replacing Wtp with the name that is appropriate in the context (ctx.wtp in wiktextract, for example).

LeMoussel commented 8 months ago

For LOCALYEAR, LOCALMONTH, LOCALDAY2 & LOCALHOUR I found this: Date et heure « locale » (Europe centrale CET/CEST sur le Wikipédia francophone)

For this, I offer you the following code Test_fn.sh:

#!/usr/bin/env python

from collections.abc import Callable
from datetime import datetime, timezone

def localyear_fn(
    ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
    """Implements the LOCALYEAR magic word."""
    utc_dt = datetime.now(timezone.utc)
    return str(utc_dt.astimezone().year)

def localmonth_fn(
    ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
    """Implements the LOCALMONTH magic word."""
    utc_dt = datetime.now(timezone.utc)
    return utc_dt.astimezone().strftime("%m")

def localday2_fn(
    ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
    """Implements the LOCALDAY2 magic word."""
    utc_dt = datetime.now(timezone.utc)
    return utc_dt.astimezone().strftime("%d")

def localhour_fn(
    ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
    """Implements the LOCALHOUR magic word."""
    utc_dt = datetime.now(timezone.utc)
    return utc_dt.astimezone().strftime("%H")

#
# FR Wikipedia:Sandbox https://fr.wikipedia.org/w/index.php?title=Aide:Bac_%C3%A0_sable&veaction=edit
#
# {{LOCALYEAR}} -> 2024
print(f"LOCALYEAR: {localyear_fn(None, None, None, None)}")
# {{LOCALMONTH}} -> 03
print(f"LOCALMONTH: {localmonth_fn(None, None, None, None)}")
# {{LOCALDAY2}} -> 07
print(f"LOCALDAY2: {localday2_fn(None, None, None, None)}")
# {{LOCALHOUR}} -> 10
print(f"LOCALHOUR: {localhour_fn(None, None, None, None)}")
kristian-clausal commented 8 months ago

On a minor note, do not save python files as .sh, that's bound to cause problems down the line! I'm copying these implementations (and adding a localday_fn, the syntax there is "%-d") and naming you co-author, these look good enough.

LeMoussel commented 8 months ago

OK. For next time, I'm going to create a Pull Request. This will make it easier for you.

LeMoussel commented 8 months ago

My first PR :) Listen to you if I did well

kristian-clausal commented 8 months ago

It seems correct, thank you for your contribution!

LeMoussel commented 8 months ago

https://github.com/tatuylonen/wikitextprocessor/issues/226#issuecomment-1983125097. Issue #invoque corrected. Test code:

    wtp = Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
    )
    wxr = WiktextractContext(wtp, WiktionaryConfig())

    wiki_page_title = 'Chobits'

    wiki_page = wxr.wtp.get_page(wiki_page_title)

    wxr.wtp.start_page(wiki_page.title)
    wxr.wtp.invoke_aliases = wxr.wtp.invoke_aliases | {"#invoque"}

    wiki_nodes = wxr.wtp.parse(text=wiki_page.body)
    text = clean_node(
            wxr=wxr,
            sense_data={},
            wikinode=wiki_nodes,
            collect_links=False,
            node_handler_fn=clean_node_handler,
            template_fn=template_handler,
        )
LeMoussel commented 8 months ago

https://github.com/tatuylonen/wikitextprocessor/issues/226#issuecomment-1983077338 Issue about #tag creating non-allowed tag corrected.

LeMoussel commented 8 months ago

With all the corrections made, I carried out a new analysis on 1000 Wikipedia pages. Here are the remaining errors/warnings

2024-03-08 11:12:13 ERROR    29: LUA error in #invoke('Durée', 'duree')
2024-03-08 11:12:13 ERROR    2: LUA error in #invoke('Excerpt', 'main')
2024-03-08 11:12:13 ERROR    2: LUA error in #invoke('Titulaires', 'tableauDesDirigeants')
2024-03-08 11:12:13 ERROR    1: LUA error in #invoke('Mapframe', 'main')
2024-03-08 11:12:13 ERROR    1: LUA error in #invoke('Jumelages', 'tableauDesJumelages')
2024-03-08 11:12:13 ERROR    1: TimeOut
2024-03-08 11:12:13 WARNING  1: invalid attribute format '' missing name

There are only 9 articles out of 1000 with errors/warnings. :thumbsup: :ok_hand: Attached wiki_errors.csv of all errors/warnings

kristian-clausal commented 8 months ago

That is excellent, thank you for checking for these errors, it really helps! I'll take a look at these next week, currently I'm trying to figure out if an error we're getting on the Wiktextract side is because of us or because of changes made (and reverted) on Wikimedia's side of things...

LeMoussel commented 8 months ago

I'm going to create another issue (https://github.com/tatuylonen/wikitextprocessor/issues/243) not related to errors/warnings, because I noticed, in certain texts, the presence of spurious text.

kristian-clausal commented 8 months ago

Hi, could you take a look at the current situation? We've made a lot of small changes, so many of these errors are probably affected.

LeMoussel commented 8 months ago

Here is the situation on the 1000 page analysis

2024-03-13 15:00:51 INFO     Traitement parallèle fonctionnant sur 12 CPU cores
2024-03-13 15:01:02 INFO     Page Wikipedia traitées: 100
2024-03-13 15:01:11 INFO     Page Wikipedia traitées: 200
2024-03-13 15:01:21 INFO     Page Wikipedia traitées: 300
2024-03-13 15:01:30 INFO     Page Wikipedia traitées: 400
2024-03-13 15:01:39 INFO     Page Wikipedia traitées: 500
2024-03-13 15:01:44 ERROR    'Choisy-le-Roi' -> 1 ERR
2024-03-13 15:01:44 ERROR    'Créteil' -> 1 ERR
2024-03-13 15:01:45 ERROR    'Droit' -> 2 ERR
2024-03-13 15:01:48 INFO     Page Wikipedia traitées: 600
2024-03-13 15:01:57 ERROR    'Ford' -> 2 ERR
2024-03-13 15:01:58 INFO     Page Wikipedia traitées: 700
2024-03-13 15:01:59 ERROR    'Fonds monétaire international' -> 16 ERR
2024-03-13 15:02:01 ERROR    'Élection présidentielle française de 1965' -> 6 ERR
2024-03-13 15:02:01 ERROR    'Élection présidentielle française de 1969' -> 7 ERR
2024-03-13 15:02:02 WARNING  'Festival de Cannes' -> 1 WARN
2024-03-13 15:02:08 INFO     Page Wikipedia traitées: 800
2024-03-13 15:02:18 INFO     Page Wikipedia traitées: 900
2024-03-13 15:02:49 INFO     Traitement terminé !

=> Of these 1000 pages, 1 page in warning (WARN) and 7 pages in error (ERR).

Summary of error type:

2024-03-13 15:10:03 ERROR    29: LUA error in #invoke('Durée', 'duree')
2024-03-13 15:10:03 ERROR    2: LUA error in #invoke('Excerpt', 'main')
2024-03-13 15:10:03 ERROR    2: LUA error in #invoke('Titulaires', 'tableauDesDirigeants')
2024-03-13 15:10:03 ERROR    2: TimeOut
2024-03-13 15:10:03 ERROR    1: LUA error in #invoke('Mapframe', 'main')
2024-03-13 15:10:03 ERROR    1: LUA error in #invoke('Jumelages', 'tableauDesJumelages')
2024-03-13 15:10:03 WARNING  1: invalid attribute format '' missing name

wiki_errors.csv

LeMoussel commented 8 months ago

Test invoke 'Durée'

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig

class TestDurée(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp=Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig(),
        )

    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_durée(self):
        self.wxr.wtp.start_page("Test Durée")
        # https://fr.wikipedia.org/wiki/Mod%C3%A8le:Dur%C3%A9e
        tree = self.wxr.wtp.parse(text="{{Durée|13|3|2024}}", expand_all=True)
        clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(len(self.wxr.wtp.errors), 0)

[ ] Test KO with error Test Durée: ERROR: LUA error in #invoke('Durée', 'duree') parent ('Modèle:Durée', {1: '13', 2: '3', 3: '2024'}) at ['Test Durée', 'Durée', '#invoke', '#invoke'] [string "Durée"]:67: attempt to perform arithmetic on a string value

LeMoussel commented 8 months ago

Test invoke 'Titulaires'

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig

class TestTitulaires(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp=Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig(),
        )

    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_titulaires(self):
        self.wxr.wtp.start_page("Test Titulaires")
        # https://fr.wikipedia.org/wiki/Mod%C3%A8le:Titulaires
        tree = self.wxr.wtp.parse(text="{{Titulaires}}", expand_all=True)
        clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(len(self.wxr.wtp.errors), 0)

[ ] Test KO with error Test Titulaires: ERROR: LUA error in #invoke('Titulaires', 'tableauDesTitulaires') parent ('Modèle:Titulaires', {}) at ['Test Titulaires', 'Titulaires', '#invoke', '#invoke'] [string "Titulaires"]:904: Pas d'entité Wikidata pour l'élément.

LeMoussel commented 8 months ago

Test invoke 'Mapframe'

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig

class TestMapframe(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp=Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig(),
        )

    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_mapframe(self):
        self.wxr.wtp.start_page("Test Mapframe")
        tree = self.wxr.wtp.parse(text="{{#invoke:Mapframe|main}}", expand_all=True)
        clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(len(self.wxr.wtp.errors), 0)

[ ] Test KO with error Test Mapframe: ERROR: LUA error in #invoke('Mapframe', 'main') parent None at ['Test Mapframe', '#invoke', '#invoke'] [string "Mapframe"]:997: attempt to index local 'parent' (a nil value)

LeMoussel commented 8 months ago

Test invoke 'Jumelages'

from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig

class TestJumelages(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp=Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig(),
        )

    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_jumelages(self):
        self.wxr.wtp.start_page("Test Jumelages")
        # https://fr.wikipedia.org/wiki/Mod%C3%A8le:Jumelages
        tree = self.wxr.wtp.parse(text="{{Jumelages}}", expand_all=True)
        clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(len(self.wxr.wtp.errors), 0)

[ ] Test KO with error Test Jumelages: ERROR: LUA error in #invoke('Jumelages', 'tableauDesJumelages') parent ('Modèle:Jumelages', {}) at ['Test Jumelages', 'Jumelages', '#invoke', '#invoke'] [string "Jumelages"]:207: Pas d'entité Wikidata pour l'élément.

kristian-clausal commented 8 months ago

Update about what I'm currently doing, regarding the error in Durée: there's an XXX in our implementation of formatDate that says the we need to actually do formatting. So formatDate returns a 'wrongly' formatted string, the automatic string->number casting in Lua can't handle it, and so the ...formatDate()/3600 evaluation fails. Just need to complete the formatDate implementation, which can either be pretty simple or a real pain. We'll have to see...

kristian-clausal commented 8 months ago

I will make a PR with an implementation of mw.language:formatDate now, which should fix the issue with 'Durée'.

I spent much, much too much time trying to roll my own implementation... And then I found out that timefn, the parser function for {{#time: ...}} already had all the needed code!!

LeMoussel commented 7 months ago
kristian-clausal commented 7 months ago

Test invoke 'Mapframe'

[ ] Test KO with error Test Mapframe: ERROR: LUA error in #invoke('Mapframe', 'main') parent None at ['Test Mapframe', '#invoke', '#invoke'] [string "Mapframe"]:997: attempt to index local 'parent' (a nil value)

The issue here is that you're calling Mapframe|main directly. Wiki uses 'frames', object layers that refer to stuff like article, template call, module call, and frames can have parent frames which called them originally; in this case, from our code's perspective (not wiki's, though that also fails for the same reason ours fails at the next step) there is no parent, so frame:getParent() returns a nil object that causes the error here.

Using the template {{Maplink}} means we don't get this specific error at line 997, but another one, probably related to missing map data.

[string "Mapframe"]:791: attempt to concatenate a nil value

791:

        attribs.text = '&#x202F;' .. util.getParameterValue(args, 'text') or '&#x202F;' .. L10n.defaults.text

which fails when util.getParameterValue returns nil for args.text. I think there's a bug in this code, because if util.getParameterValue can return nil, then doing the .. concatenation operation on it won't result in the other option in the or being chosen, but an error. I think this should probably be something like

        attribs.text = '&#x202F;' .. (util.getParameterValue(args, 'text') or L10n.defaults.text)

I tested this change, and now there's no more Lua errors.

kristian-clausal commented 7 months ago

The above bug on line 791 seems to be different from the English original. I've left a message on the module talk page, I've given up on editing wikis; let the dice fall where they may.

Fun fact: did you know that Caesar's famed line when he crossed the Rubicon, 'alea iacta est' was him quoting a play? It was a popculture reference.

kristian-clausal commented 7 months ago

Errors related to Jumelages and other "can't find args.wikidata" stuff was actually complicated by my misreading of what was happening.

I tried out Créteil, and it worked on this end now, for example.

The issue is that you are testing these templates and modules out of context, so they're lacking parameters. I was convinced getParent() -> args.wikidata had to refer to the main article, but no, like xxyzz pointed out it was just that these parents were the frames of the template above the Module level, which should have been called with a |wikidata=...| argument. There was no default page meta value that was accessed from the page's frame...

Please check out these errors against. Remember to enable extension tags, too. Please don't call modules or templates completely without context, if a template has arguments try to find an example and use that (also, when starting the page remember that the page title is an actual variable that modules and templates and the page itself can access so that needs to be 'appropriate' for the context in case it is needed somewhere).

LeMoussel commented 7 months ago

Python code to test

# https://github.com/tatuylonen/wikitextprocessor/
from wikitextprocessor import (
    Wtp,
    NodeKind,
    wikidata,
    WikiNode,
)

# https://github.com/tatuylonen/wiktextract
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wiktextract.page import clean_node

def clean_node_handler(node) -> Optional[str]:
    if node.kind == NodeKind.TEMPLATE:
        if node.largs[0][0] in [
            "Semi-protection",
            "Semi-protection longue",
            "Confusion",
            "coord",
            "Portail",
            "Voir homonymes",
        ]:
            return ""
        if re.match(r"\s*Infobox\s*", node.largs[0][0], re.I):
            return ""
        if re.match(r"\s*Article\s*", node.largs[0][0], re.I):
            return ""
        if re.match(r"\s*Référence\s*", node.largs[0][0], re.I):
            return ""

    if node.kind == NodeKind.LEVEL2:
        if node.largs[0][0] in ["Annexes", "Notes et références", "Voir aussi"]:
            return ""

    if node.kind == NodeKind.LINK:
        if re.match(r"\s*Fichier\s*:", node.largs[0][0], re.I):
            return ""

    return None

def template_handler(name, args_ht):
    if name == "Méta bandeau de note":
        if len(args_ht) > 0:
            if "icône" in args_ht:
                args_ht["icône"] = ""
    return None

if __name__ == "__main__":
    extension_tags = {
        "maplink": {"parents": ["phrasing"], "content": ["phrasing"]},
        "poem": {"parents": ["phrasing"], "content": ["phrasing"]},
        "gallery": {"parents": ["phrasing"], "content": ["phrasing"]},
        "graph": {"parents": ["phrasing"], "content": ["phrasing"]},
        "mapframe": {"parents": ["phrasing"], "content": ["phrasing"]},
        "timeline": {"parents": ["phrasing"], "content": ["phrasing"]},
    }
    wxr = WiktextractContext(
        wtp=Wtp(
            db_path="fr-wiki-latest.db",
            lang_code="fr",
            project="wikipedia",
            extension_tags=extension_tags,
        ),
        config=WiktionaryConfig(),
    )

    wiki_page_title = "Créteil"

    wiki_page = wxr.wtp.get_page(wiki_page_title)

    wxr.wtp.start_page(wiki_page.title)
    wxr.wtp.invoke_aliases = wxr.wtp.invoke_aliases | {"#invoque"}

    info_log = f"Analyse: '{wiki_page_title}'\n"

    wiki_nodes = wxr.wtp.parse(text=wiki_page.body)
    text = clean_node(
        wxr=wxr,
        sense_data={},
        wikinode=wiki_nodes,
        collect_links=False,
        node_handler_fn=clean_node_handler,
        template_fn=template_handler,
    )

    if len(wxr.wtp.errors) > 0:
        info_log += f"# Erreurs: {len(wxr.wtp.errors)}\n"
    if len(wxr.wtp.warnings) > 0:
        info_log += f"# Warnings: {len(wxr.wtp.warnings)}"

    print(info_log)

After updating Git repositories

kristian-clausal commented 7 months ago

I will be away next week, but I will continue to look at things after that.

I made a mistake with the Jumelages stuff earlier; I thought it was fixed, because I wasn't getting any error messages... The issue was that I screwed up by redirecting the output of the test into a file, because I was also printing out the output text, and then completely forgot I'd done so, so I missed the errors. As you are aware, the issues with these articles persist, and unfortunately it's a big thing to fix because we need to implement a lot of the Wikibase extension. I've found reading the code a bit difficult, because I can't even figure out which of the several (there are MANY) getEntity functions is the one I should be concerned with here. Additionally we'd need to create a lot of lua code in-between to replicate method functions for some kind of special return value table you get from getEntity (hopefully it's pretty much the same as the data returned from Wikidata) etc., etc., it's going to be a mess.

LeMoussel commented 7 months ago

OK. If I can help you (knowing that I have little understanding of the Wikimedia Template architecture), don't hesitate to ask me.

xxyzz commented 7 months ago

I think it'll be quite difficult to implement mw.wikibase.getEntity with our current sqlite cache approach, because it has nested wikidata property tables. Even if we implement this API, for the template "Titulaires", it's a table so the converted text will be empty or some combined garbage.

@LeMoussel I'd suggest you take a moment to read the doc of Beautiful Soup or lxml in the meantime. I believe your goal(get whole page text, don't care about HTML/wikitext structure) could achieved by using HTML parser on HTML dump file. For example: you could use get_text() or xpath.

kristian-clausal commented 7 months ago

If we query wikidata 'through' wikibase and cache each response separately, could that work? Basically a separate system for mw.wikibase stuff... Ugh. I'm annoyed just thinking about it. More and more stuff... But it would be really nice to get something working. Might be infeasible.

xxyzz commented 7 months ago

The difficulty is the data structure. wikidata's data are stored in rdf(graph) database, it's awkward to save this data structure in sql tables(imagine which data owns which property, it's many to many relationship). We'll also have to use use a rdf database to re-implement wikidata's database. IMO, it's impractical...

kristian-clausal commented 7 months ago

If we save each result of a query as a copy of that query, without caring about making interconnections and just "flattening it", using the Lua table we get as a result and ignoring all of the database stuff, how about then? Caching the stuff you get from mw.wikibase.getEntity and other functions and methods.

xxyzz commented 7 months ago

I don't think these is a MediaWiki API that could run any Lua code. And the API could only return JSON data not Lua objects. We'll have to convert the data back to nested Lua table if this API exists. Create this nested Lua property table from wikidata query results is also difficult. IMO, the benefit of implement mw.wikibase.getEntity is too small(wiktionary doesn't use it) but it requires too much efforts...

I should take a look how mediawiki implements this API, last time I read the wikidata extension code I have a hard time to find the code actually implements the API...

xxyzz commented 7 months ago

260 and #262 should fix the Lua errors in page Ford.

LeMoussel commented 7 months ago

Updating Git wikitextprocessor & wiktextract repositories

wiktextract

dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ git show
commit 45680ad3527cef40cebb53044a02a2b1d4c0463e (HEAD -> master, origin/master, origin/HEAD)
Merge: cdbac033 834d2e68
Author: xxyzz <gitpull@protonmail.com>
Date:   Wed Apr 3 16:03:05 2024 +0800

    Merge pull request #568 from xxyzz/de

    Improve de edition extract translation code

wikitextprocessor

dev@dev-B550M-DS3H:~/Python/WikiExtractor/wikitextprocessor$ git show
commit f732746175cfbd3c2916428636f7e40b02a74219 (HEAD -> main, origin/main, origin/HEAD)
Merge: 07538d6 14dda0a
Author: xxyzz <gitpull@protonmail.com>
Date:   Wed Apr 3 17:00:33 2024 +0800

    Merge pull request #264 from xxyzz/number_of_articles

    Implement "NUMBEROFPAGES" and "NUMBEROFARTICLES" magic words

I got this error:

Traceback (most recent call last):
  File "/home/dev/Python/WikiExtractor/./testPage.py", line 202, in <module>
    text = clean_node(
  File "/home/dev/Python/WikiExtractor/wiktextract/src/wiktextract/page.py", line 361, in clean_node
    v = wxr.wtp.node_to_html(
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1997, in node_to_html
    return to_html(
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/node_expand.py", line 221, in to_html
    expanded = ctx.expand(
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1713, in expand
    expanded = expand_recurse(encoded, parent, not pre_expand)
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1643, in expand_recurse
    t = expand_recurse(
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1501, in expand_recurse
    ret = expand_parserfn(
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1425, in expand_parserfn
    ret = invoke_fn(args, expander, parent)
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1310, in invoke_fn
    ret = call_lua_sandbox(self, invoke_args, expander, parent, timeout)
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/luaexec.py", line 395, in call_lua_sandbox
    initialize_lua(ctx)  # This sets ctx.lua
  File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/luaexec.py", line 334, in initialize_lua
    lua.table_from(copy.deepcopy(ctx.NAMESPACE_DATA), recursive=True),  # type: ignore
TypeError: table_from() got an unexpected keyword argument 'recursive'
xxyzz commented 7 months ago

You need to update the Lupa package.

kristian-clausal commented 7 months ago

Updating the Lupa package doesn't work. This needs better instructions; upgrading through pip didn't seem to work. Reinstalling wikitextprocessor doesn't work, because pip already has 2.1...

LeMoussel commented 7 months ago

In line with the comment of @kristian-clausal Lupa version package on my system

dev@dev-B550M-DS3H:~/Python/WikiExtractor$ pip show lupa
Name: lupa
Version: 2.1
Summary: Python wrapper around Lua and LuaJIT
Home-page: https://github.com/scoder/lupa
Author: Stefan Behnel
Author-email: stefan_ml@behnel.de
License: MIT style
Location: /home/dev/.local/lib/python3.10/site-packages
Requires: 
Required-by: wikitextprocessor