Closed nemobis closed 7 years ago
Can you add some test cases to demonstrate that this does the right thing (and not some likely wrong things)?
Do you mean in the python-mwcites/datasets/mw_dump_stub.xml file? Maybe, but for now I'll focus on testing a regex that gets good output for me (on it.wiki).
Simple grepping à la pbzip2 -dc itwiki-20170620-pages-articles-multistream.xml.bz2 | grep "10\." | grep -Eo '10\.[[:digit:]]+/[^./[:space:]}?,|]+'
shows quite a few DOIs with dots from a couple publishers (like 10.1016/j.bcp.2007.07.045 ), so maybe we should ignore that part.
Let's continue on the issue
They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5
Cf. https://github.com/mediawiki-utilities/python-mwcites/issues/7