molybdenum-99 / infoboxer

Wikipedia information extraction library
MIT License
174 stars 16 forks source link
data-extraction mediawiki wikipedia

Infoboxer

Gem Version Build Status Coverage Status Code Climate Infoboxer Gitter

Infoboxer is pure-Ruby Wikipedia (and generic MediaWiki) client and parser, targeting information extraction (hence the name).

It can be useful in tasks like:

The whole idea is: you can have any Wikipedia page as a parsed tree with obvious structure, you can navigate that tree easily, and you have a bunch of hi-level helpers method, so typical information extraction tasks should be super-easy, one-liners in best cases.

(For those already thinking "Why should you do this, we already have DBPedia?" -- please, read "Reasons" page in our wiki.)

Showcase

Infoboxer.wikipedia.
  get('Breaking Bad (season 1)').
  sections('Episodes').templates(name: 'Episode table').
  fetch('episodes').templates(name: /^Episode list/).
  fetch_hashes('EpisodeNumber', 'EpisodeNumber2', 'Title', 'ShortSummary')
# => [{"EpisodeNumber"=>#<Var(EpisodeNumber): 1>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 1>, "Title"=>#<Var(Title): Pilot>, "ShortSummary"=>#<Var(ShortSummary): Walter White, a 50-year old che...>},
#     {"EpisodeNumber"=>#<Var(EpisodeNumber): 2>, "EpisodeNumber2"=>#<Var(EpisodeNumber2): 2>, "Title"=>#<Var(Title): Cat's in the Bag...>, "ShortSummary"=>#<Var(ShortSummary): Walt and Jesse try to dispose o...>},
#     ...and so on

Do you feel it now?

You also can take a look at Showcase.

Usage

Install gem

Install it as usual: gem 'infoboxer' in your Gemfile, then bundle install.

Or just [sudo] gem install infoboxer if you prefer.

Grab the page

# From English Wikipedia
page = Infoboxer.wikipedia.get('Argentina')
# or
page = Infoboxer.wp.get('Argentina')

# From other language Wikipedia:
page = Infoboxer.wikipedia('fr').get('Argentina')

# From any wiki with the same engine:
page = Infoboxer.wiki('http://companywiki.com').get('Our Product')

See more examples and options at Retrieving pages

Play with page

Basically, page is a tree of Nodes, you can think of it as some kind of DOM.

So, you can navigate it:

# Simple traversing and inspect
node = page.children.first.children.first
node.to_tree
node.to_text

# Various lookups
page.lookup(:Template, name: /^Infobox/)

See Tree navigation basics.

On the top of the basic navigation Infoboxer adds some useful shortcuts for convenience and brevity, which allows things like this:

page.section('Episodes').tables.first

See Navigation shortcuts

To put it all in one piece, also take a look at Data extraction tips and tricks.

infoboxer executable

Just try infoboxer command.

Without any options, it starts IRB session with infoboxer required and included into main namespace.

With -w option, it provides a shortcut to MediaWiki instance you want. Like this:

$ infoboxer -w https://en.wikipedia.org/w/api.php
> get('Argentina')
 => #<Page(title: "Argentina", url: "https://en.wikipedia.org/wiki/Argentina"): ....

You can also use shortcuts like infoboxer -w wikipedia for common wikies (and, just for fun, infoboxer -wikipedia also).

Advanced topics

Compatibility

As of now, Infoboxer reported to be compatible with any MRI Ruby since 2.0.0 (1.9.3 previously, dropped since Infoboxer 0.2.0). In Travis-CI tests, JRuby is failing due to bug in old Java 7/Java 8 SSL certificate support (see here), and Rubinius failing 3 specs of 500 by mystery, which is uninvestigated yet.

Therefore, those Ruby versions are excluded from Travis config, though, they may still work for you.

Links

License

MIT.