tatuylonen / wikitextprocessor

Python package for WikiMedia dump processing (Wiktionary, Wikipedia etc). Wikitext parsing, template expansion, Lua module execution. For data extraction, bulk syntax checking, error detection, and offline formatting.
Other
90 stars 23 forks source link

Data dumps do not contain interwiki link (Wikidata) data #257

Closed kristian-clausal closed 4 months ago

kristian-clausal commented 4 months ago
          **Test invoke 'Titulaires'**
from unittest import TestCase

from wikitextprocessor import Wtp
from wiktextract.page import clean_node
from wiktextract.wxr_context import WiktextractContext
from wiktextract.config import WiktionaryConfig

class TestTitulaires(TestCase):
    def setUp(self):
        self.wxr = WiktextractContext(
            wtp=Wtp(
                db_path="fr-wiki-latest.db",
                lang_code="fr",
                project="wikipedia",
            ),
            config=WiktionaryConfig(),
        )

    def tearDown(self):
        self.wxr.wtp.close_db_conn()

    def test_titulaires(self):
        self.wxr.wtp.start_page("Test Titulaires")
        # https://fr.wikipedia.org/wiki/Mod%C3%A8le:Titulaires
        tree = self.wxr.wtp.parse(text="{{Titulaires}}", expand_all=True)
        clean_node(
            wxr=self.wxr,
            sense_data={},
            wikinode=tree,
        )
        self.assertEqual(len(self.wxr.wtp.errors), 0)

[ ] Test KO with error Test Titulaires: ERROR: LUA error in #invoke('Titulaires', 'tableauDesTitulaires') parent ('Modèle:Titulaires', {}) at ['Test Titulaires', 'Titulaires', '#invoke', '#invoke'] [string "Titulaires"]:904: Pas d'entité Wikidata pour l'élément.

Originally posted by @LeMoussel in https://github.com/tatuylonen/wikitextprocessor/issues/226#issuecomment-1994652325

After some digging, I am pretty confident in saying that this error can't be fixed as things currently stand.

Parser functions like #property and #statement, and in this case lua frame objects with a frame.args.wikidata value, apparently (although couldn't find documentation for this) have access to a Wikidata Q123456789 style identified that links a Wikipedia page to a Wikidata item. This way, they can default to using the page's own Wikidata property if no Wikidata identifier is provided.

These properties are set in the Tool menu on a Wikipedia article on the right, so they're not part of the source code. There's some indication (I think? I'm not sure I understand it) that creating a [[wikidata:...]] or [[d:...]] link can also work like this...

AFAICT, the data dumps we use don't have that metadata, so our #statement, #property and modules like French Wikipedia Module:Titulaires can't work properly.

Any work-arounds for this (polling a database somewhere to get the right Wikidata link for a page) seem costly in time use. Doing it when creating the .db cache file also sounds like it would significantly make things slower there. The best solution would be to find a magical, easily parsed file somewhere among the dump files with this data... There might be some .sql.gz files like

Interwiki link tracking records frwiki-20240301-iwlinks.sql.gz 76.4 MB

but I'll leave it alone in case someone else has anything simpler to work with.

kristian-clausal commented 4 months ago

We took a look at these .SQL files, and they contain SQL directions that can be used to recreate the database.

The data linking a page id and its wikidata id is https://dumps.wikimedia.org/XXwiki/latest/XXwiki-latest-page_props.sql.gz

Because SQL is a programming language and the strings have \' properly escaped apostrophes, Tatu thinks it would be pretty simple to convert the SQL source into a CSV (or internal list form) with regex substitutions, and then extract the wikidata-references (actually ´wikibase_somethingorother´ it's on another computer) by processing that data.

AFAICT, our database schema doesn't have an ID field for 'page', but that data is actually in the original article dump file, in the XML, so it should be possible to get it from there.

By crossreferencing the id with the page-props wikibase id data, we can give each page (in this case Wikipedia page, I don't think Wiktionary pages have associated Wikidata pages, although I guess they might have Wikibase ids..?) a Wikidata reference ID that can be used by #statement, #property and added to the page's frame.args.wikidata field in make_frame.

kristian-clausal commented 4 months ago

This would mean creating a new --page-props-file (or similar) parameter for wiktwords that can be optionally used to extract stuff (maybe not even just Wikidata references) from page-props.sql, and then adding that data into our database, either as a field for page or a new table. I'm not sure which is idiomatic, but I would think that this would seem most sensible as a field for page because it's only the one reference per page.

Other files in the dump files contain other non-page data, if anyone can think of anything we could do with that data, please point it out for future consideration.

xxyzz commented 4 months ago

Is this about the not implemented mw.wikibase.getEntity? The results of this function contain not only links of other wikis but also lots of complex wikidata property tables(the "claims" table, and it's used later at line 858). You could run mw.logObject(mw.wikibase.getEntity('Q42')) in any module edit page's Lua debug console to see the data.

This requires a wikidata rdf database, our simple sqlite cache is not up to the task. Since this api is marked as expensive in red text in their document and it's used to create a table not page text(like the example sentence source text the current code is implemented for), I'd suggest we ignore this Lua error for now.

And IMO the current code calls the wikidata query api is the best we could do to implement these wikidata apis, the wikidata dump file is over 100G and it runs a rdf database, we simply can't re-implement wikidata. And the code performance bottleneck is call_lua_sandbox and re.sub, the time of wikidata query could be ignored compared to them.

I also want to point out the args.wikidata at line 901 is the argument of the "Titulaires" template, not part of the Lua frame object.

kristian-clausal commented 4 months ago

This is not about getEntity. This is about the fact that page data should have a simple reference field that points that particular page towards a wikibase entry. What you are talking about is using data in modules and page sources, but this is meta-data that is directly attached to the page source code, but not part of the source code; it's a default reference for a page when a Q* code is not given.

xxyzz commented 4 months ago

But we don't need the wikidata item id for each page(especially for Wiktionary), add it won't solve any issue... it's not really important. And I already added some code to get the wikidata item id for a page title.

kristian-clausal commented 4 months ago

If we can reliably get the page item id from the page title, then this is solved for #property and #statement, but frame.args.wikidata can't be a function, and we can't populate it for every page with an expensive outside call to Wikidata.

xxyzz commented 4 months ago

The current wikidata query for a page title is reliably and aren't the issues of #property and #statement already solved? And #property and #statement are not called for every page they are only added for these French Wikipedia issues. And again, args.wikidata is the "Titulaires" template argument.

The sql file you linked only have the wikidata item ids of the dump file titles, but the Lua code could request for a title not in the dump file, then we still need to call wikidata query api.

And I think both parser functions return more than just the wikidata item id, they also need to return wikidata property id and value, so we have to call wikidata query api again.

kristian-clausal commented 4 months ago

If you call #property or #statement without an id, it will default to the page id.

frame.args.wikidata is not the Titulaires template argument, it's taken from the parent frame which is the page itself.

Article frames have an args.wikidata field that comes from article metadata.

xxyzz commented 4 months ago

I think call these parser function without any argument is very rare and could be ignored...

I have checked frame.args.wikidata is nil for code like this on a page that has wikidata item id Q22:

local export = {}
function export.test(frame)
    return frame.args.wikidata
end
return export

I'd say adding wikidata item ids is kind of low priority... Even if we have to add them no matter what I would consider load the sql file in sqlite or mysql instead of using regex.

kristian-clausal commented 4 months ago

Titulaires is getting args.wikidata from somewhere, but it's not a template argument. Is this maybe a fr.wikipedia.org thing?

xxyzz commented 4 months ago

I think mw.wikibase.getEntity accepts item id not passed(or nil argument), it will then use page title to query wikidata.

kristian-clausal commented 4 months ago

@xxyzz you are, correct, and I was wrong; I'd convinced myself that the parent frame was actually the article frame when it was the template frame (with an args.wikidata field). This basically means all of this is moot, and #statement and #property can just use an extra query to get the ID (and then use the ID for whatever query they do).

kristian-clausal commented 4 months ago

@xxyzz

fr.wikipedia.org has the template

=== Jumelages ===

{{Jumelages|zoom=1|titre=Villes jumelées avec Créteil}}[[Fichier:Creteilpanneau.jpg|thumb|Panneau d'entrée de la ville, en 2006.]]{{Note|texte=La municipalité de [[Novi Beograd]] ne figure plus dans la liste actuelle.|groupe=Note}}

Modèle:Jumelages is:

<includeonly>{{#Invoke:Jumelages|tableauDesJumelages}}</includeonly><noinclude>{{Documentation}}</noinclude>

and Module:Jumelages|tableaDesJumelages breaks here:

function p.tableauDesJumelages(frame)
    local args = frame:getParent().args

    -- Entité Wikidata
    local entity = wd.getEntity(args.wikidata)
    if not entity then
        error('Pas d\'entité Wikidata pour l\'élément.')
    end

Can you figure out where it is getting the args.wikidata on the article page (because this is not throwing an error on Wikipedia's side)?

Reopening this issue again.

xxyzz commented 4 months ago

mw.wikibase.getEntity will use the page title if wikidata item id is not passed(or nil), this is in the Lua API document.

kristian-clausal commented 4 months ago

Oh, of course, that's what you were trying to say earlier. Thanks!