jpatokal / mediawiki-gateway

Ruby framework for MediaWiki API manipulation
Other
133 stars 50 forks source link

Ability to grab sections of a Wikipedia article #15

Closed branliu0 closed 10 years ago

branliu0 commented 13 years ago

Hi,

For my own project, I'm currently writing a ruby script built on top of this gem and Nokogiri that can easily extract content from just a section of a Wikipedia article. For example, for the article on bananas (http://en.wikipedia.org/wiki/Banana), I might only want to grab the section on Taxonomy and nothing else. My script would make that really easy by specifying the page title and the section number.

I'm interested in contributing this feature to this project, but I'm wondering whether it's appropriate. The functionality isn't supported by the API, and I'm getting it to work by parsing through the HTML, so Wikimedia provides no guarantees that this will always work. This feature also wouldn't work on all Wikimedia projects, since not all of them have a Table of Contents and are broken down into sections. For example, this works on Wikipedia and Wiktionary, but would not work for Wikisource.

What do you think?

Best, Brandon

jpatokal commented 13 years ago

Actually, that functionality is available in the API, through the rvsection parameter of the Query - Revisions API call:

http://www.mediawiki.org/wiki/API:Query_-_Properties#revisions_.2F_rv

So you're more than welcome to extend the get method or write a new get_section method to handle this.

branliu0 commented 13 years ago

Hmm, thanks for the response! I shifted to working on a different part of my project, so I'll come back to this when I need to do some Wikipedia scraping again. I didn't get a chance to look into that API call in depth.

blackwinter commented 10 years ago

This should be possible now by specifying the rvsection to retrieve (see #61):

MediaWiki::Gateway.new('https://en.wikipedia.org/w/api.php').get('Banana', 'rvsection' => 0)
jpatokal commented 10 years ago

Closing for now, although pull requests to package this up more cleanly are welcome.