Open wetneb opened 8 years ago
@halfak: Just in case you are still interested in evaluating what proportion of citations do not have any identifier, I have run my citation parser on a fresh dump of the English Wikipedia.
Of course, this parser covers much more than just scholarly citations (it parses {{cite web}} for instance). It also misses a lot of citations that your method catches (all unformatted citations with an identifier matching your regular expressions). So the scope is quite different.
Here are a few quick stats:
$ wc -l enwiki_2016-06-01_CS1_citations.tsv 12743634
$ cat enwiki_2016-06-01_CS1_citations.tsv| grep "cite journal" | wc -l 955050
$ cat enwiki_2016-06-01_CS1_citations.tsv| grep "cite journal" | grep -v "ID_list" | wc -l 309305
cc @nemobis who might also be interested in this dataset
Thanks! 30 % of "cite journal" without any identifier is much better than I thought.
Hi,
In some cases we might need to extract not just identifiers but also the rest of the metadata contained in {{cite}} templates. In this case, the task looks less trivial (author lists can be input in many different ways, for instance). For this reason, I have wrapped the Lua code that parses citations on wikipedia in a Python lib, and the result is here: https://github.com/dissemin/wikiciteparser
Any comments / contributions / anything welcome!