mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

Extracting more metadata (such as authors) #10

Open wetneb opened 8 years ago

wetneb commented 8 years ago

Hi,

In some cases we might need to extract not just identifiers but also the rest of the metadata contained in {{cite}} templates. In this case, the task looks less trivial (author lists can be input in many different ways, for instance). For this reason, I have wrapped the Lua code that parses citations on wikipedia in a Python lib, and the result is here: https://github.com/dissemin/wikiciteparser

Any comments / contributions / anything welcome!

wetneb commented 8 years ago

@halfak: Just in case you are still interested in evaluating what proportion of citations do not have any identifier, I have run my citation parser on a fresh dump of the English Wikipedia.

The dump is on Zenodo.

Of course, this parser covers much more than just scholarly citations (it parses {{cite web}} for instance). It also misses a lot of citations that your method catches (all unformatted citations with an identifier matching your regular expressions). So the scope is quite different.

Here are a few quick stats:

cc @nemobis who might also be interested in this dataset

nemobis commented 8 years ago

Thanks! 30 % of "cite journal" without any identifier is much better than I thought.