mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

Do not match second slash and dot in DOI #11

Closed nemobis closed 7 years ago

nemobis commented 7 years ago

They are the only reserved characters, according to https://www.doi.org/doi_handbook/2_Numbering.html#2.5

Cf. https://github.com/mediawiki-utilities/python-mwcites/issues/7

halfak commented 7 years ago

Can you add some test cases to demonstrate that this does the right thing (and not some likely wrong things)?

nemobis commented 7 years ago

Do you mean in the python-mwcites/datasets/mw_dump_stub.xml file? Maybe, but for now I'll focus on testing a regex that gets good output for me (on it.wiki).

halfak commented 7 years ago

I'd test it in https://github.com/nemobis/python-mwcites/blob/618032d7b649d2910b8252270443bc85fa16029e/mwcites/extractors/tests/test_doi.py

nemobis commented 7 years ago

Simple grepping à la pbzip2 -dc itwiki-20170620-pages-articles-multistream.xml.bz2 | grep "10\." | grep -Eo '10\.[[:digit:]]+/[^./[:space:]}?,|]+' shows quite a few DOIs with dots from a couple publishers (like 10.1016/j.bcp.2007.07.045 ), so maybe we should ignore that part.

nemobis commented 7 years ago

Let's continue on the issue