spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
779 stars 129 forks source link

resolve infobox keys #561

Open deneb2 opened 1 year ago

deneb2 commented 1 year ago

I was trying the Toronto Raptors example.

The let fields = doc.infobox(0).json() returns a json where the infobox keys are exactly those in the wikitext (coach: { text: 'Darko Rajaković', links: [ [Object] ] },). Also trying wtf-plugin-html it seems the rendered keys are the same capitalized and cleaned.

But looking at the wikipedia page the coach entry is actually Head Coach. The step missing is the one that looks at the infobox template and adjust key names for rendering.

Is this feature missing or I did something wrong?

spencermountain commented 1 year ago

You know what, I find this very confusing too. You can see this library simply grabs whatever's in the wikitext, and the names of the keys are speced in the template documentation.

image

But yeah, there seems to be a ton of formatting that is done by wikipedia at render-time. These formatting rules don't appear to be anywhere in wikipedia-land, and must be in the source-code of the parsoid Html renderer. I've looked around before, and come up with nothing. Please let me know if you can find where this logic is stored, and if it is available to be re-used in projects like this one.

yeah, as you found, the wtf-plugin-html doesn't do anything clever, but really should. cheers

einSelbst commented 1 year ago

out of curiosity I took a look at this and being a noob in all of this it seems to me that wikipedia is a cascade of templates in templates and the specifically mentioned key is coming from a "sub-template"

So on this page https://en.wikipedia.org/w/index.php?title=Toronto_Raptors&action=edit

it says "Pages transcluded onto the current version of this page" and mentions the template for the infobox of basketball clubs:

https://en.wikipedia.org/w/index.php?title=Template:Infobox_basketball_club&action=edit

which is referenced in the 5th line of the template:

{{Infobox basketball club

have a look for

| label19 = Head coach{{#if:{{{coaches|}}}|es}}
| data19 = {{if empty|{{{coaches|}}}|{{{coach|}}}}}

HTH

PS: sorry if this was already clear as the template was already mentioned in the initial question