mdoering / wikipedia-dwca

A parser for wikipedia species pages using the taxabox template
7 stars 2 forks source link

A Wikipedia Parser generating a Darwin Core Archive for species pages using the taxobox or speciesbox template and their derivates. The parser focuses on the English, German, Spanish and French wikipedias currently and works on the article xml dumps

Multimedia, vernacular names and textual descriptions are extracted. Every section of a wiki page will become a distinct description record with the section title becoming the description "type".

How to run it

java -jar wikipedia.jar Downloading and processing the entire english wikipedia takes a long time. Depending on your network and CPU expect the program to run for several days.

Supported Wikitext Templates

Taxon information

For automatic taxonboxes the classification from the Taxonomy templates are scraped.

Palaeo templates

List templates

Citation templates

General templates