mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

<doi>+<doi> is a common pattern #4

Open halfak opened 9 years ago

halfak commented 9 years ago

Another type of failure I see is looks like this: 10.1086/591526+10.1088/0004-637X/706/1/L203

I'm not sure how we'd be able to tell that a "+" is not part of the DOI.

When I search for this exact string, I found this listing: http://arxiv.org/abs/0805.4758 It seems that both DOIs are associated with the same paper. One of the paper itself and another is an errata for the paper!

I'm thinking that we might get high fitness by having a special rule in the parser for splitting characters like "+&?". If we see them right before some whitespace or a new DOI_START, then stop reading the DOI.

halfak commented 9 years ago

I'm also seeing DOIs like this: "10.1002/ajp.22007/abstract;jsessionid=397B42DDD36E4F654BAB381E3104ABB3.d02t04" So we should include semi-colon in the list of splitting characters.