Closed kristian-clausal closed 4 months ago
We took a look at these .SQL files, and they contain SQL directions that can be used to recreate the database.
The data linking a page id and its wikidata id is https://dumps.wikimedia.org/XXwiki/latest/XXwiki-latest-page_props.sql.gz
Because SQL is a programming language and the strings have \'
properly escaped apostrophes, Tatu thinks it would be pretty simple to convert the SQL source into a CSV (or internal list form) with regex substitutions, and then extract the wikidata-references (actually ´wikibase_somethingorother´ it's on another computer) by processing that data.
AFAICT, our database schema doesn't have an ID field for 'page', but that data is actually in the original article dump file, in the XML, so it should be possible to get it from there.
By crossreferencing the id with the page-props wikibase id data, we can give each page (in this case Wikipedia page, I don't think Wiktionary pages have associated Wikidata pages, although I guess they might have Wikibase ids..?) a Wikidata reference ID that can be used by #statement, #property and added to the page's frame.args.wikidata
field in make_frame
.
This would mean creating a new --page-props-file
(or similar) parameter for wiktwords
that can be optionally used to extract stuff (maybe not even just Wikidata references) from page-props.sql, and then adding that data into our database, either as a field for page
or a new table. I'm not sure which is idiomatic, but I would think that this would seem most sensible as a field for page
because it's only the one reference per page.
Other files in the dump files contain other non-page data, if anyone can think of anything we could do with that data, please point it out for future consideration.
Is this about the not implemented mw.wikibase.getEntity? The results of this function contain not only links of other wikis but also lots of complex wikidata property tables(the "claims" table, and it's used later at line 858). You could run mw.logObject(mw.wikibase.getEntity('Q42'))
in any module edit page's Lua debug console to see the data.
This requires a wikidata rdf database, our simple sqlite cache is not up to the task. Since this api is marked as expensive in red text in their document and it's used to create a table not page text(like the example sentence source text the current code is implemented for), I'd suggest we ignore this Lua error for now.
And IMO the current code calls the wikidata query api is the best we could do to implement these wikidata apis, the wikidata dump file is over 100G and it runs a rdf database, we simply can't re-implement wikidata. And the code performance bottleneck is call_lua_sandbox
and re.sub
, the time of wikidata query could be ignored compared to them.
I also want to point out the args.wikidata
at line 901 is the argument of the "Titulaires" template, not part of the Lua frame object.
This is not about getEntity. This is about the fact that page data should have a simple reference field that points that particular page towards a wikibase entry. What you are talking about is using data in modules and page sources, but this is meta-data that is directly attached to the page source code, but not part of the source code; it's a default reference for a page when a Q* code is not given.
But we don't need the wikidata item id for each page(especially for Wiktionary), add it won't solve any issue... it's not really important. And I already added some code to get the wikidata item id for a page title.
If we can reliably get the page item id from the page title, then this is solved for #property and #statement, but frame.args.wikidata can't be a function, and we can't populate it for every page with an expensive outside call to Wikidata.
The current wikidata query for a page title is reliably and aren't the issues of #property and #statement already solved? And #property and #statement are not called for every page they are only added for these French Wikipedia issues. And again, args.wikidata
is the "Titulaires" template argument.
The sql file you linked only have the wikidata item ids of the dump file titles, but the Lua code could request for a title not in the dump file, then we still need to call wikidata query api.
And I think both parser functions return more than just the wikidata item id, they also need to return wikidata property id and value, so we have to call wikidata query api again.
If you call #property or #statement without an id, it will default to the page id.
frame.args.wikidata is not the Titulaires template argument, it's taken from the parent frame which is the page itself.
Article frames have an args.wikidata field that comes from article metadata.
I think call these parser function without any argument is very rare and could be ignored...
I have checked frame.args.wikidata
is nil
for code like this on a page that has wikidata item id Q22:
local export = {}
function export.test(frame)
return frame.args.wikidata
end
return export
I'd say adding wikidata item ids is kind of low priority... Even if we have to add them no matter what I would consider load the sql file in sqlite or mysql instead of using regex.
Titulaires
is getting args.wikidata
from somewhere, but it's not a template argument. Is this maybe a fr.wikipedia.org thing?
I think mw.wikibase.getEntity
accepts item id not passed(or nil
argument), it will then use page title to query wikidata.
@xxyzz you are, correct, and I was wrong; I'd convinced myself that the parent frame was actually the article frame when it was the template frame (with an args.wikidata field). This basically means all of this is moot, and #statement and #property can just use an extra query to get the ID (and then use the ID for whatever query they do).
@xxyzz
fr.wikipedia.org has the template
=== Jumelages ===
{{Jumelages|zoom=1|titre=Villes jumelées avec Créteil}}[[Fichier:Creteilpanneau.jpg|thumb|Panneau d'entrée de la ville, en 2006.]]{{Note|texte=La municipalité de [[Novi Beograd]] ne figure plus dans la liste actuelle.|groupe=Note}}
Modèle:Jumelages is:
<includeonly>{{#Invoke:Jumelages|tableauDesJumelages}}</includeonly><noinclude>{{Documentation}}</noinclude>
and Module:Jumelages|tableaDesJumelages breaks here:
function p.tableauDesJumelages(frame)
local args = frame:getParent().args
-- Entité Wikidata
local entity = wd.getEntity(args.wikidata)
if not entity then
error('Pas d\'entité Wikidata pour l\'élément.')
end
Can you figure out where it is getting the args.wikidata on the article page (because this is not throwing an error on Wikipedia's side)?
Reopening this issue again.
mw.wikibase.getEntity
will use the page title if wikidata item id is not passed(or nil), this is in the Lua API document.
Oh, of course, that's what you were trying to say earlier. Thanks!
[ ] Test KO with error
Test Titulaires: ERROR: LUA error in #invoke('Titulaires', 'tableauDesTitulaires') parent ('Modèle:Titulaires', {}) at ['Test Titulaires', 'Titulaires', '#invoke', '#invoke'] [string "Titulaires"]:904: Pas d'entité Wikidata pour l'élément.
Originally posted by @LeMoussel in https://github.com/tatuylonen/wikitextprocessor/issues/226#issuecomment-1994652325
After some digging, I am pretty confident in saying that this error can't be fixed as things currently stand.
Parser functions like #property and #statement, and in this case lua
frame
objects with aframe.args.wikidata
value, apparently (although couldn't find documentation for this) have access to a WikidataQ123456789
style identified that links a Wikipedia page to a Wikidata item. This way, they can default to using the page's own Wikidata property if no Wikidata identifier is provided.These properties are set in the
Tool
menu on a Wikipedia article on the right, so they're not part of the source code. There's some indication (I think? I'm not sure I understand it) that creating a[[wikidata:...]]
or[[d:...]]
link can also work like this...AFAICT, the data dumps we use don't have that metadata, so our
#statement
,#property
and modules like French Wikipedia Module:Titulaires can't work properly.Any work-arounds for this (polling a database somewhere to get the right Wikidata link for a page) seem costly in time use. Doing it when creating the .db cache file also sounds like it would significantly make things slower there. The best solution would be to find a magical, easily parsed file somewhere among the dump files with this data... There might be some .sql.gz files like
Interwiki link tracking records frwiki-20240301-iwlinks.sql.gz 76.4 MB
but I'll leave it alone in case someone else has anything simpler to work with.