spencermountain / wtf_wikipedia

a pretty-committed wikipedia markup parser
https://observablehq.com/@spencermountain/wtf_wikipedia
MIT License
770 stars 129 forks source link

Support collapsible list #17

Closed richard5mith closed 5 years ago

richard5mith commented 9 years ago

Now, I've tried to write my own parser, so I constantly want to stab Wikipedia editors in the face and know how hard this is.

But it's not parsing release dates correctly. For a page like https://en.wikipedia.org/wiki/Tomb_Raider_(2013_video_game), there's nothing in the parsed JSON at all.

DBPedia can't get it right either, so you're not alone.

spencermountain commented 9 years ago

haha, thanks richard. I'll take a look at it.

spencermountain commented 9 years ago

hey, that page works pretty well for me:

var wtf_wikipedia=require("wtf_wikipedia")
wtf_wikipedia.from_api("Tomb_Raider_(2013_video_game)", 'en', function(s) {
  console.log(wtf_wikipedia.parse(s).infobox)
})
richard5mith commented 9 years ago

Running that exact script I get this, which contains no releasedates section at all...

{ title: { text: 'Tomb Raider', links: undefined },
  image: { text: '256px', links: undefined },
  developer: { text: 'Crystal Dynamics', links: [ [Object] ] },
  publisher: 
   { text: 'Square Enix Bandai Namco Games (Australia) Feral Interactive',
     links: [ [Object], [Object], [Object] ] },
  director: 
   { text: 'Noah Hughes Daniel Chayer Daniel Neuburger',
     links: undefined },
  producer: 
   { text: 'Kyle Peschel Alexander W. Offermann',
     links: undefined },
  programmer: { text: 'Scott Krotz', links: undefined },
  artist: { text: 'Brian Horton', links: undefined },
  writer: 
   { text: 'Rhianna Pratchett Susan O\'Connor',
     links: [ [Object] ] },
  composer: { text: 'Jason Graves', links: [ [Object] ] },
  series: { text: 'Tomb Raider', links: [ [Object] ] },
  platforms: 
   { text: 'Microsoft Windows OS X PlayStation 3  PlayStation 4  Xbox 360  Xbox One',
     links: [ [Object], [Object], [Object], [Object], [Object], [Object] ] },
  genre: { text: 'Action-adventure', links: [ [Object] ] },
  modes: 
   { text: 'Single-player, multiplayer',
     links: [ [Object], [Object] ] } }

That page is unique (among my own parsing anyway) in containing a collapsable release dates section, which even seems to flummox DBPedia.

spencermountain commented 9 years ago

ah! haha - https://github.com/spencermountain/wtf_wikipedia/blob/master/src/parse/parse_infobox.js#L8 i'll take a look at what it would mean to represent the collapsible list template

spencermountain commented 7 years ago

for reference, here's the syntax

| engine = 
| released = {{collapsible list|title=5 March 2013|'''Microsoft Windows''', '''PlayStation 3''', '''Xbox 360'''{{Video game release|WW|5 March 2013}}'''OS X'''{{Video game release|WW|23 January 2014}}'''PlayStation 4''', '''Xbox One'''{{Video game release|NA|28 January 2014|EU|31 January 2014}}'''Linux'''{{Video game release|WW|27 April 2016}}}}
| genre = 
}}
spencermountain commented 5 years ago

hey, collapsible lists are better supported now in our new template parsing setup, from 6.3.0. here's how to retrieve that data now, as the api changed earlier this year:

wtf.fetch('Tomb_Raider_(2013_video_game)', 'en', function(err, doc) {
  console.log(doc.infoboxes(0).keyValue());
});