Extracting more metadata (such as authors)

wetneb commented 8 years ago

Hi,

In some cases we might need to extract not just identifiers but also the rest of the metadata contained in {{cite}} templates. In this case, the task looks less trivial (author lists can be input in many different ways, for instance). For this reason, I have wrapped the Lua code that parses citations on wikipedia in a Python lib, and the result is here: https://github.com/dissemin/wikiciteparser

Any comments / contributions / anything welcome!

wetneb commented 8 years ago

@halfak: Just in case you are still interested in evaluating what proportion of citations do not have any identifier, I have run my citation parser on a fresh dump of the English Wikipedia.

The dump is on Zenodo.

Of course, this parser covers much more than just scholarly citations (it parses {{cite web}} for instance). It also misses a lot of citations that your method catches (all unformatted citations with an identifier matching your regular expressions). So the scope is quite different.

Here are a few quick stats:

the total number of citations extracted:

$ wc -l enwiki_2016-06-01_CS1_citations.tsv
12743634

the number of "cite journal" instances:

$ cat enwiki_2016-06-01_CS1_citations.tsv| grep "cite journal" | wc -l
955050

"cite journal" instances without any external identifier:

$ cat enwiki_2016-06-01_CS1_citations.tsv| grep "cite journal" | grep -v "ID_list" | wc -l
309305

cc @nemobis who might also be interested in this dataset

nemobis commented 8 years ago

Thanks! 30 % of "cite journal" without any identifier is much better than I thought.

mediawiki-utilities / python-mwcites

Extracting more metadata (such as authors) #10