Closed LeMoussel closed 7 months ago
For unimplemented parserfn PROTECTIONLEVEL
, I suggest this correction in parsefn.py
....
"PROTECTIONLEVEL": protectionlevel_fn, #unimplemented_fn,
....
def protectionlevel_fn(
ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
"""Implements the PROTECTIONLEVEL magic word."""
# Returns an empty string to indicate that the page is not protected."""
return ""
If you don't want to make a pull request, I'll implement that. It seems the most sensible approach here, yeah.
I don't know GitHub well to make a pull request. I'll let you do it. Thanks.
Pushed a commit to the #filepath PR (might as well lump them together).
@LeMoussel I've now gone through and committed fixes to most issues. The CSV had two new issues, mw.ext.data.get
not being implemented (it's an extension that isn't used in Wiktionary) and getBadges
not being implemented (it's a new function introduced in 2022~2023).
PROTECTIONLEVEL
, #property
should also be handled. If you could check out all of these issues and see if they work on your end now, that would be grand.
First of all, let me congratulate you both. You do a hell of a job! THANKS.
So I updated both packages and ran an analysis process on the first 1,000 articles in the database. Attached is a CSV file of the errors and/or warnings encountered: wiki_errors.csv
The number of errors/warnings by wording is summarized below:
2024-03-06 14:58:03 ERROR 1: LUA error in #invoke('Mapframe', 'main')
2024-03-06 14:58:03 ERROR 1: LUA error in #invoke('Jumelages', 'tableauDesJumelages')
2024-03-06 14:58:03 ERROR 2: LUA error in #invoke('Excerpt', 'main', ' only = \U00102195', ' files = ', ' lists = ', ' templates = ', ' paragraphs = ', ' references = ', ' subsections = ', ' bold = ', ' more = ', ' hat = ', ' this = ', ' quote = ', ' inline = ')
2024-03-06 14:58:03 ERROR 2: LUA error in #invoke('Titulaires', 'tableauDesDirigeants')
2024-03-06 14:58:03 ERROR 16: LUA error in #invoke('Durée', 'duree')
2024-03-06 14:58:03 ERROR 10: LUA error in #invoke('Graph', 'chartWrapper')
2024-03-06 14:58:03 ERROR 13: LUA error in #invoke('Durée', 'duree', 'en année=1')
2024-03-06 14:58:03 ERROR 2: TimeOut
2024-03-06 14:58:03 WARNING 51: #tag creating non-allowed tag <maplink> - omitted
2024-03-06 14:58:03 WARNING 19: #tag creating non-allowed tag <poem> - omitted
2024-03-06 14:58:03 WARNING 10: #tag creating non-allowed tag <mapframe> - omitted
2024-03-06 14:58:03 WARNING 6: #tag creating non-allowed tag <graph> - omitted
2024-03-06 14:58:03 WARNING 1: invalid attribute format '' missing name
2024-03-06 14:58:03 WARNING 2: #tag creating non-allowed tag <timeline> - omitted
Previous errors all seem to be resolved. Goob job !
TimeOut
error is due to the fact that the analysis of the article is longer than 30 seconds.
This matches the article Dreamcast & Écriture hiéroglyphique égyptienne.
It should be noted that the articles is important. I don't know if this long processing time (> 30 sec) is normal.
There seems to me to be a regression in the generated text.
Indeed, in certain text, there is the presence of class=noviewer
. Unless I'm mistaken, this was not present before.
For example, we find this for Algèbre générale, Algèbre linéaire, Arc de triomphe de l'Étoile
I found why. See https://github.com/tatuylonen/wikitextprocessor/issues/225#issuecomment-1985213379
What is surprising is that there are quite a few errors on unrecognized parser function '#invoque'
Is this a typo (invoque
vs invoke
) in the articles?
If the article doesn't look right, then it's probably a typo, but it is also very possible that French wikipedia just... accepts invoque
. I'm going to take a look, hopefully it's the first.
Please give data on #invoque errors, there are none in the new csv.
Please encode your file in UTF-8 for maximum portability.
Just to talk about most of the stuff in the previous csv (not the invoque stuff), any error about #tag creating non-allowed tag
has already been 'resolved' in https://github.com/tatuylonen/wikitextprocessor/issues/209
You need to create extension tag data (just copy paste and repeat the stuff seen in that thread) when you process it. These are extensions, not core Wikimedia stuff, so we can't 100% predict what will turn up in a tag, so I made a (maybe just temporary) parameter in extension_tags
that takes extension tag information and enables you to parse things like <maplinks>
, <poem>
, <graph>
etc. Relevant code here: https://github.com/tatuylonen/wikitextprocessor/issues/209#issuecomment-1961082816
The 'html-like' tags will now be parsed, and you can handle them further (for example, you can ignore them) by using a node_handler
function passed into other functions.
The issue with #invoque
was the more annoying one (ie. we have to do some work), and fr.wikipedia obviously allows it as an alias for #invoke
. I made a PR with some simple changes, it shouldn't break anything (but it's just a smidge less efficient because we're using "something in set_of_strings" instead of "something == string"), so I'll probably merge it toot sweet.
In this case, you need to either initialize Wtp with invoke_aliases
, a set of strings that stand for aliases of #invoke
, or you can modify Wtp.invoke_aliases like I did for the test in the pull request. Wtp.invoke_aliases
is a set of strings, so you can use Wtp.invoke_aliases = Wtp.invoke_aliases | {"#invoque"}
(with the #
) to modify it, replacing Wtp
with the name that is appropriate in the context (ctx.wtp
in wiktextract, for example).
For LOCALYEAR
, LOCALMONTH
, LOCALDAY2
& LOCALHOUR
I found this: Date et heure « locale » (Europe centrale CET/CEST sur le Wikipédia francophone)
For this, I offer you the following code Test_fn.sh:
#!/usr/bin/env python
from collections.abc import Callable
from datetime import datetime, timezone
def localyear_fn(
ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
"""Implements the LOCALYEAR magic word."""
utc_dt = datetime.now(timezone.utc)
return str(utc_dt.astimezone().year)
def localmonth_fn(
ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
"""Implements the LOCALMONTH magic word."""
utc_dt = datetime.now(timezone.utc)
return utc_dt.astimezone().strftime("%m")
def localday2_fn(
ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
"""Implements the LOCALDAY2 magic word."""
utc_dt = datetime.now(timezone.utc)
return utc_dt.astimezone().strftime("%d")
def localhour_fn(
ctx: "Wtp", fn_name: str, args: list[str], expander: Callable[[str], str]
) -> str:
"""Implements the LOCALHOUR magic word."""
utc_dt = datetime.now(timezone.utc)
return utc_dt.astimezone().strftime("%H")
#
# FR Wikipedia:Sandbox https://fr.wikipedia.org/w/index.php?title=Aide:Bac_%C3%A0_sable&veaction=edit
#
# {{LOCALYEAR}} -> 2024
print(f"LOCALYEAR: {localyear_fn(None, None, None, None)}")
# {{LOCALMONTH}} -> 03
print(f"LOCALMONTH: {localmonth_fn(None, None, None, None)}")
# {{LOCALDAY2}} -> 07
print(f"LOCALDAY2: {localday2_fn(None, None, None, None)}")
# {{LOCALHOUR}} -> 10
print(f"LOCALHOUR: {localhour_fn(None, None, None, None)}")
On a minor note, do not save python files as .sh, that's bound to cause problems down the line! I'm copying these implementations (and adding a localday_fn, the syntax there is "%-d"
) and naming you co-author, these look good enough.
OK. For next time, I'm going to create a Pull Request. This will make it easier for you.
It seems correct, thank you for your contribution!
https://github.com/tatuylonen/wikitextprocessor/issues/226#issuecomment-1983125097. Issue #invoque
corrected.
Test code:
wtp = Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
)
wxr = WiktextractContext(wtp, WiktionaryConfig())
wiki_page_title = 'Chobits'
wiki_page = wxr.wtp.get_page(wiki_page_title)
wxr.wtp.start_page(wiki_page.title)
wxr.wtp.invoke_aliases = wxr.wtp.invoke_aliases | {"#invoque"}
wiki_nodes = wxr.wtp.parse(text=wiki_page.body)
text = clean_node(
wxr=wxr,
sense_data={},
wikinode=wiki_nodes,
collect_links=False,
node_handler_fn=clean_node_handler,
template_fn=template_handler,
)
https://github.com/tatuylonen/wikitextprocessor/issues/226#issuecomment-1983077338 Issue about #tag creating non-allowed tag
corrected.
With all the corrections made, I carried out a new analysis on 1000 Wikipedia pages. Here are the remaining errors/warnings
2024-03-08 11:12:13 ERROR 29: LUA error in #invoke('Durée', 'duree')
2024-03-08 11:12:13 ERROR 2: LUA error in #invoke('Excerpt', 'main')
2024-03-08 11:12:13 ERROR 2: LUA error in #invoke('Titulaires', 'tableauDesDirigeants')
2024-03-08 11:12:13 ERROR 1: LUA error in #invoke('Mapframe', 'main')
2024-03-08 11:12:13 ERROR 1: LUA error in #invoke('Jumelages', 'tableauDesJumelages')
2024-03-08 11:12:13 ERROR 1: TimeOut
2024-03-08 11:12:13 WARNING 1: invalid attribute format '' missing name
There are only 9 articles out of 1000 with errors/warnings. :thumbsup: :ok_hand: Attached wiki_errors.csv of all errors/warnings
That is excellent, thank you for checking for these errors, it really helps! I'll take a look at these next week, currently I'm trying to figure out if an error we're getting on the Wiktextract side is because of us or because of changes made (and reverted) on Wikimedia's side of things...
I'm going to create another issue (https://github.com/tatuylonen/wikitextprocessor/issues/243) not related to errors/warnings, because I noticed, in certain texts, the presence of spurious text.
Hi, could you take a look at the current situation? We've made a lot of small changes, so many of these errors are probably affected.
Here is the situation on the 1000 page analysis
2024-03-13 15:00:51 INFO Traitement parallèle fonctionnant sur 12 CPU cores
2024-03-13 15:01:02 INFO Page Wikipedia traitées: 100
2024-03-13 15:01:11 INFO Page Wikipedia traitées: 200
2024-03-13 15:01:21 INFO Page Wikipedia traitées: 300
2024-03-13 15:01:30 INFO Page Wikipedia traitées: 400
2024-03-13 15:01:39 INFO Page Wikipedia traitées: 500
2024-03-13 15:01:44 ERROR 'Choisy-le-Roi' -> 1 ERR
2024-03-13 15:01:44 ERROR 'Créteil' -> 1 ERR
2024-03-13 15:01:45 ERROR 'Droit' -> 2 ERR
2024-03-13 15:01:48 INFO Page Wikipedia traitées: 600
2024-03-13 15:01:57 ERROR 'Ford' -> 2 ERR
2024-03-13 15:01:58 INFO Page Wikipedia traitées: 700
2024-03-13 15:01:59 ERROR 'Fonds monétaire international' -> 16 ERR
2024-03-13 15:02:01 ERROR 'Élection présidentielle française de 1965' -> 6 ERR
2024-03-13 15:02:01 ERROR 'Élection présidentielle française de 1969' -> 7 ERR
2024-03-13 15:02:02 WARNING 'Festival de Cannes' -> 1 WARN
2024-03-13 15:02:08 INFO Page Wikipedia traitées: 800
2024-03-13 15:02:18 INFO Page Wikipedia traitées: 900
2024-03-13 15:02:49 INFO Traitement terminé !
=> Of these 1000 pages, 1 page in warning (WARN) and 7 pages in error (ERR).
Summary of error type:
2024-03-13 15:10:03 ERROR 29: LUA error in #invoke('Durée', 'duree')
2024-03-13 15:10:03 ERROR 2: LUA error in #invoke('Excerpt', 'main')
2024-03-13 15:10:03 ERROR 2: LUA error in #invoke('Titulaires', 'tableauDesDirigeants')
2024-03-13 15:10:03 ERROR 2: TimeOut
2024-03-13 15:10:03 ERROR 1: LUA error in #invoke('Mapframe', 'main')
2024-03-13 15:10:03 ERROR 1: LUA error in #invoke('Jumelages', 'tableauDesJumelages')
2024-03-13 15:10:03 WARNING 1: invalid attribute format '' missing name
Test invoke 'Durée'
from unittest import TestCase
from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
class TestDurée(TestCase):
def setUp(self):
self.wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
),
config=WiktionaryConfig(),
)
def tearDown(self):
self.wxr.wtp.close_db_conn()
def test_durée(self):
self.wxr.wtp.start_page("Test Durée")
# https://fr.wikipedia.org/wiki/Mod%C3%A8le:Dur%C3%A9e
tree = self.wxr.wtp.parse(text="{{Durée|13|3|2024}}", expand_all=True)
clean_node(
wxr=self.wxr,
sense_data={},
wikinode=tree,
)
self.assertEqual(len(self.wxr.wtp.errors), 0)
[ ] Test KO with error Test Durée: ERROR: LUA error in #invoke('Durée', 'duree') parent ('Modèle:Durée', {1: '13', 2: '3', 3: '2024'}) at ['Test Durée', 'Durée', '#invoke', '#invoke'] [string "Durée"]:67: attempt to perform arithmetic on a string value
Test invoke 'Titulaires'
from unittest import TestCase
from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
class TestTitulaires(TestCase):
def setUp(self):
self.wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
),
config=WiktionaryConfig(),
)
def tearDown(self):
self.wxr.wtp.close_db_conn()
def test_titulaires(self):
self.wxr.wtp.start_page("Test Titulaires")
# https://fr.wikipedia.org/wiki/Mod%C3%A8le:Titulaires
tree = self.wxr.wtp.parse(text="{{Titulaires}}", expand_all=True)
clean_node(
wxr=self.wxr,
sense_data={},
wikinode=tree,
)
self.assertEqual(len(self.wxr.wtp.errors), 0)
[ ] Test KO with error Test Titulaires: ERROR: LUA error in #invoke('Titulaires', 'tableauDesTitulaires') parent ('Modèle:Titulaires', {}) at ['Test Titulaires', 'Titulaires', '#invoke', '#invoke'] [string "Titulaires"]:904: Pas d'entité Wikidata pour l'élément.
Test invoke 'Mapframe'
from unittest import TestCase
from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
class TestMapframe(TestCase):
def setUp(self):
self.wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
),
config=WiktionaryConfig(),
)
def tearDown(self):
self.wxr.wtp.close_db_conn()
def test_mapframe(self):
self.wxr.wtp.start_page("Test Mapframe")
tree = self.wxr.wtp.parse(text="{{#invoke:Mapframe|main}}", expand_all=True)
clean_node(
wxr=self.wxr,
sense_data={},
wikinode=tree,
)
self.assertEqual(len(self.wxr.wtp.errors), 0)
[ ] Test KO with error Test Mapframe: ERROR: LUA error in #invoke('Mapframe', 'main') parent None at ['Test Mapframe', '#invoke', '#invoke'] [string "Mapframe"]:997: attempt to index local 'parent' (a nil value)
Test invoke 'Jumelages'
from unittest import TestCase
from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
class TestJumelages(TestCase):
def setUp(self):
self.wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
),
config=WiktionaryConfig(),
)
def tearDown(self):
self.wxr.wtp.close_db_conn()
def test_jumelages(self):
self.wxr.wtp.start_page("Test Jumelages")
# https://fr.wikipedia.org/wiki/Mod%C3%A8le:Jumelages
tree = self.wxr.wtp.parse(text="{{Jumelages}}", expand_all=True)
clean_node(
wxr=self.wxr,
sense_data={},
wikinode=tree,
)
self.assertEqual(len(self.wxr.wtp.errors), 0)
[ ] Test KO with error Test Jumelages: ERROR: LUA error in #invoke('Jumelages', 'tableauDesJumelages') parent ('Modèle:Jumelages', {}) at ['Test Jumelages', 'Jumelages', '#invoke', '#invoke'] [string "Jumelages"]:207: Pas d'entité Wikidata pour l'élément.
Update about what I'm currently doing, regarding the error in Durée: there's an XXX in our implementation of formatDate that says the we need to actually do formatting. So formatDate returns a 'wrongly' formatted string, the automatic string->number casting in Lua can't handle it, and so the ...formatDate()/3600
evaluation fails. Just need to complete the formatDate implementation, which can either be pretty simple or a real pain. We'll have to see...
I will make a PR with an implementation of mw.language:formatDate now, which should fix the issue with 'Durée'.
I spent much, much too much time trying to roll my own implementation... And then I found out that timefn
, the parser function for {{#time: ...}}
already had all the needed code!!
Test invoke 'Mapframe'
[ ] Test KO with error
Test Mapframe: ERROR: LUA error in #invoke('Mapframe', 'main') parent None at ['Test Mapframe', '#invoke', '#invoke'] [string "Mapframe"]:997: attempt to index local 'parent' (a nil value)
The issue here is that you're calling Mapframe|main
directly. Wiki uses 'frames', object layers that refer to stuff like article, template call, module call, and frames can have parent frames which called them originally; in this case, from our code's perspective (not wiki's, though that also fails for the same reason ours fails at the next step) there is no parent, so frame:getParent() returns a nil object that causes the error here.
Using the template {{Maplink}}
means we don't get this specific error at line 997, but another one, probably related to missing map data.
[string "Mapframe"]:791: attempt to concatenate a nil value
791:
attribs.text = ' ' .. util.getParameterValue(args, 'text') or ' ' .. L10n.defaults.text
which fails when util.getParameterValue returns nil
for args.text. I think there's a bug in this code, because if util.getParameterValue can return nil
, then doing the ..
concatenation operation on it won't result in the other option in the or
being chosen, but an error. I think this should probably be something like
attribs.text = ' ' .. (util.getParameterValue(args, 'text') or L10n.defaults.text)
I tested this change, and now there's no more Lua errors.
The above bug on line 791 seems to be different from the English original. I've left a message on the module talk page, I've given up on editing wikis; let the dice fall where they may.
Fun fact: did you know that Caesar's famed line when he crossed the Rubicon, 'alea iacta est' was him quoting a play? It was a popculture reference.
Errors related to Jumelages and other "can't find args.wikidata" stuff was actually complicated by my misreading of what was happening.
I tried out Créteil, and it worked on this end now, for example.
The issue is that you are testing these templates and modules out of context, so they're lacking parameters. I was convinced getParent() -> args.wikidata
had to refer to the main article, but no, like xxyzz pointed out it was just that these parents were the frames of the template above the Module level, which should have been called with a |wikidata=...|
argument. There was no default page meta value that was accessed from the page's frame...
Please check out these errors against. Remember to enable extension tags, too. Please don't call modules or templates completely without context, if a template has arguments try to find an example and use that (also, when starting the page remember that the page title is an actual variable that modules and templates and the page itself can access so that needs to be 'appropriate' for the context in case it is needed somewhere).
# https://github.com/tatuylonen/wikitextprocessor/
from wikitextprocessor import (
Wtp,
NodeKind,
wikidata,
WikiNode,
)
# https://github.com/tatuylonen/wiktextract
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig
from wiktextract.page import clean_node
def clean_node_handler(node) -> Optional[str]:
if node.kind == NodeKind.TEMPLATE:
if node.largs[0][0] in [
"Semi-protection",
"Semi-protection longue",
"Confusion",
"coord",
"Portail",
"Voir homonymes",
]:
return ""
if re.match(r"\s*Infobox\s*", node.largs[0][0], re.I):
return ""
if re.match(r"\s*Article\s*", node.largs[0][0], re.I):
return ""
if re.match(r"\s*Référence\s*", node.largs[0][0], re.I):
return ""
if node.kind == NodeKind.LEVEL2:
if node.largs[0][0] in ["Annexes", "Notes et références", "Voir aussi"]:
return ""
if node.kind == NodeKind.LINK:
if re.match(r"\s*Fichier\s*:", node.largs[0][0], re.I):
return ""
return None
def template_handler(name, args_ht):
if name == "Méta bandeau de note":
if len(args_ht) > 0:
if "icône" in args_ht:
args_ht["icône"] = ""
return None
if __name__ == "__main__":
extension_tags = {
"maplink": {"parents": ["phrasing"], "content": ["phrasing"]},
"poem": {"parents": ["phrasing"], "content": ["phrasing"]},
"gallery": {"parents": ["phrasing"], "content": ["phrasing"]},
"graph": {"parents": ["phrasing"], "content": ["phrasing"]},
"mapframe": {"parents": ["phrasing"], "content": ["phrasing"]},
"timeline": {"parents": ["phrasing"], "content": ["phrasing"]},
}
wxr = WiktextractContext(
wtp=Wtp(
db_path="fr-wiki-latest.db",
lang_code="fr",
project="wikipedia",
extension_tags=extension_tags,
),
config=WiktionaryConfig(),
)
wiki_page_title = "Créteil"
wiki_page = wxr.wtp.get_page(wiki_page_title)
wxr.wtp.start_page(wiki_page.title)
wxr.wtp.invoke_aliases = wxr.wtp.invoke_aliases | {"#invoque"}
info_log = f"Analyse: '{wiki_page_title}'\n"
wiki_nodes = wxr.wtp.parse(text=wiki_page.body)
text = clean_node(
wxr=wxr,
sense_data={},
wikinode=wiki_nodes,
collect_links=False,
node_handler_fn=clean_node_handler,
template_fn=template_handler,
)
if len(wxr.wtp.errors) > 0:
info_log += f"# Erreurs: {len(wxr.wtp.errors)}\n"
if len(wxr.wtp.warnings) > 0:
info_log += f"# Warnings: {len(wxr.wtp.warnings)}"
print(info_log)
After updating Git repositories
Créteil: DEBUG: HTML tag <center> not properly closed at ['Créteil'] parsing Créteil/Géographie/Topographie
started on line 65, detected on line 65
Créteil: ERROR: LUA error in #invoke('Jumelages', 'tableauDesJumelages') parent ('Modèle:Jumelages', {'zoom': '1', 'titre': 'Villes jumelées avec Créteil'}) at ['Créteil', 'Jumelages', '#invoke', '#invoke']
[string "Jumelages"]:207: Pas d'entité Wikidata pour l'élément.
Analyse: 'Créteil'
# Erreurs: 1
Choisy-le-Roi: ERROR: LUA error in #invoke('Mapframe', 'main') parent ('Modèle:Maplink', {'type': 'shape', 'frame': 'yes', 'frame-height': '300', 'frame-width': '300', 'frame-align': 'center', 'fill': '#000000', 'fill-opacity': '0', 'stroke-color': '#99694C', 'stroke-width': '2.5', 'type2': 'point', 'marker2': 'town-hall', 'marker-size2': 'small', 'marker-color2': '#F000FF', 'text': 'Carte de la commune avec localisation de la mairie.'}) at ['Choisy-le-Roi', 'maplink', '#invoke', '#invoke']
Coordinates must be specified on Wikidata or in |coord=
Analyse: 'Choisy-le-Roi'
# Erreurs: 1
Droit: ERROR: LUA error in #invoke('Excerpt', 'main', ' only = \U00102185', ' files = ', ' lists = ', ' templates = ', ' paragraphs = ', ' references = ', ' subsections = ', ' bold = ', ' more = ', ' hat = ', ' this = ', ' quote = ', ' inline = ') parent ('Modèle:Extrait', {1: 'Positivisme juridique'}) at ['Droit', 'Extrait', '#invoke', '#invoke']
[string "Module:TNT"]:190: Invalid message key "error_bad_msgkey"
Droit: ERROR: LUA error in #invoke('Excerpt', 'main', ' only = \U00102185', ' files = ', ' lists = ', ' templates = ', ' paragraphs = ', ' references = ', ' subsections = ', ' bold = ', ' more = ', ' hat = ', ' this = ', ' quote = ', ' inline = ') parent ('Modèle:Extrait', {1: 'Branches du droit'}) at ['Droit', 'Extrait', '#invoke', '#invoke']
[string "Module:TNT"]:190: Invalid message key "error_bad_msgkey"
Analyse: 'Droit'
# Erreurs: 2
Ford: ERROR: LUA error in #invoke('Titulaires', 'tableauDesDirigeants') parent ('Modèle:Liste des dirigeants successifs', {'types': 'directeur général', 'titre': '[[Directeurs généraux]] (CEO)', 'portrait': 'oui'}) at ['Ford', 'Liste des dirigeants successifs', '#invoke', '#invoke']
[string "Titulaires"]:904: Pas d'entité Wikidata pour l'élément.
Ford: ERROR: LUA error in #invoke('Titulaires', 'tableauDesDirigeants') parent ('Modèle:Liste des dirigeants successifs', {'types': "membre du conseil d'administration", 'titre': "Membres du [[conseil d'administration]]", 'portrait': 'oui'}) at ['Ford', 'liste des dirigeants successifs', '#invoke', '#invoke']
[string "Titulaires"]:904: Pas d'entité Wikidata pour l'élément.
Analyse: 'Ford'
# Erreurs: 2
Festival de Cannes: WARNING: invalid attribute format '' missing name at ['Festival de Cannes', '#tag', '#tag']
Analyse: 'Festival de Cannes'
# Warnings: 1
I will be away next week, but I will continue to look at things after that.
I made a mistake with the Jumelages stuff earlier; I thought it was fixed, because I wasn't getting any error messages... The issue was that I screwed up by redirecting the output of the test into a file, because I was also printing out the output text, and then completely forgot I'd done so, so I missed the errors. As you are aware, the issues with these articles persist, and unfortunately it's a big thing to fix because we need to implement a lot of the Wikibase extension. I've found reading the code a bit difficult, because I can't even figure out which of the several (there are MANY) getEntity functions is the one I should be concerned with here. Additionally we'd need to create a lot of lua code in-between to replicate method functions for some kind of special return value table you get from getEntity (hopefully it's pretty much the same as the data returned from Wikidata) etc., etc., it's going to be a mess.
OK. If I can help you (knowing that I have little understanding of the Wikimedia Template architecture), don't hesitate to ask me.
I think it'll be quite difficult to implement mw.wikibase.getEntity with our current sqlite cache approach, because it has nested wikidata property tables. Even if we implement this API, for the template "Titulaires", it's a table so the converted text will be empty or some combined garbage.
@LeMoussel I'd suggest you take a moment to read the doc of Beautiful Soup or lxml in the meantime. I believe your goal(get whole page text, don't care about HTML/wikitext structure) could achieved by using HTML parser on HTML dump file. For example: you could use get_text() or xpath.
If we query wikidata 'through' wikibase and cache each response separately, could that work? Basically a separate system for mw.wikibase stuff... Ugh. I'm annoyed just thinking about it. More and more stuff... But it would be really nice to get something working. Might be infeasible.
The difficulty is the data structure. wikidata's data are stored in rdf(graph) database, it's awkward to save this data structure in sql tables(imagine which data owns which property, it's many to many relationship). We'll also have to use use a rdf database to re-implement wikidata's database. IMO, it's impractical...
If we save each result of a query as a copy of that query, without caring about making interconnections and just "flattening it", using the Lua table we get as a result and ignoring all of the database stuff, how about then? Caching the stuff you get from mw.wikibase.getEntity and other functions and methods.
I don't think these is a MediaWiki API that could run any Lua code. And the API could only return JSON data not Lua objects. We'll have to convert the data back to nested Lua table if this API exists. Create this nested Lua property table from wikidata query results is also difficult. IMO, the benefit of implement mw.wikibase.getEntity is too small(wiktionary doesn't use it) but it requires too much efforts...
I should take a look how mediawiki implements this API, last time I read the wikidata extension code I have a hard time to find the code actually implements the API...
Updating Git wikitextprocessor
& wiktextract
repositories
wiktextract
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wiktextract$ git show
commit 45680ad3527cef40cebb53044a02a2b1d4c0463e (HEAD -> master, origin/master, origin/HEAD)
Merge: cdbac033 834d2e68
Author: xxyzz <gitpull@protonmail.com>
Date: Wed Apr 3 16:03:05 2024 +0800
Merge pull request #568 from xxyzz/de
Improve de edition extract translation code
wikitextprocessor
dev@dev-B550M-DS3H:~/Python/WikiExtractor/wikitextprocessor$ git show
commit f732746175cfbd3c2916428636f7e40b02a74219 (HEAD -> main, origin/main, origin/HEAD)
Merge: 07538d6 14dda0a
Author: xxyzz <gitpull@protonmail.com>
Date: Wed Apr 3 17:00:33 2024 +0800
Merge pull request #264 from xxyzz/number_of_articles
Implement "NUMBEROFPAGES" and "NUMBEROFARTICLES" magic words
I got this error:
Traceback (most recent call last):
File "/home/dev/Python/WikiExtractor/./testPage.py", line 202, in <module>
text = clean_node(
File "/home/dev/Python/WikiExtractor/wiktextract/src/wiktextract/page.py", line 361, in clean_node
v = wxr.wtp.node_to_html(
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1997, in node_to_html
return to_html(
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/node_expand.py", line 221, in to_html
expanded = ctx.expand(
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1713, in expand
expanded = expand_recurse(encoded, parent, not pre_expand)
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1643, in expand_recurse
t = expand_recurse(
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1501, in expand_recurse
ret = expand_parserfn(
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1425, in expand_parserfn
ret = invoke_fn(args, expander, parent)
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/core.py", line 1310, in invoke_fn
ret = call_lua_sandbox(self, invoke_args, expander, parent, timeout)
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/luaexec.py", line 395, in call_lua_sandbox
initialize_lua(ctx) # This sets ctx.lua
File "/home/dev/Python/WikiExtractor/wikitextprocessor/src/wikitextprocessor/luaexec.py", line 334, in initialize_lua
lua.table_from(copy.deepcopy(ctx.NAMESPACE_DATA), recursive=True), # type: ignore
TypeError: table_from() got an unexpected keyword argument 'recursive'
You need to update the Lupa package.
Updating the Lupa package doesn't work. This needs better instructions; upgrading through pip didn't seem to work. Reinstalling wikitextprocessor doesn't work, because pip already has 2.1...
In line with the comment of @kristian-clausal Lupa version package on my system
dev@dev-B550M-DS3H:~/Python/WikiExtractor$ pip show lupa
Name: lupa
Version: 2.1
Summary: Python wrapper around Lua and LuaJIT
Home-page: https://github.com/scoder/lupa
Author: Stefan Behnel
Author-email: stefan_ml@behnel.de
License: MIT style
Location: /home/dev/.local/lib/python3.10/site-packages
Requires:
Required-by: wikitextprocessor
Attached CSV file: wiki_errors.csv, listing, by Wikipedia article title, other errors than those indicated in the issues #225, #224, #223, #220 & #216:
In summary, there are the following errors: