A Wikipedia Parser generating a Darwin Core Archive for species pages using the taxobox or speciesbox template and their derivates. The parser focuses on the English, German, Spanish and French wikipedias currently and works on the article xml dumps

Multimedia, vernacular names and textual descriptions are extracted. Every section of a wiki page will become a distinct description record with the section title becoming the description "type".

How to run it

java -jar wikipedia.jar Downloading and processing the entire english wikipedia takes a long time. Depending on your network and CPU expect the program to run for several days.