Another type of failure I see is looks like this: 10.1086/591526+10.1088/0004-637X/706/1/L203
I'm not sure how we'd be able to tell that a "+" is not part of the DOI.
When I search for this exact string, I found this listing: http://arxiv.org/abs/0805.4758 It seems that both DOIs are associated with the same paper. One of the paper itself and another is an errata for the paper!
I'm thinking that we might get high fitness by having a special rule in the parser for splitting characters like "+&?". If we see them right before some whitespace or a new DOI_START, then stop reading the DOI.
I'm also seeing DOIs like this: "10.1002/ajp.22007/abstract;jsessionid=397B42DDD36E4F654BAB381E3104ABB3.d02t04" So we should include semi-colon in the list of splitting characters.
Another type of failure I see is looks like this:
10.1086/591526+10.1088/0004-637X/706/1/L203
I'm not sure how we'd be able to tell that a "+" is not part of the DOI.
When I search for this exact string, I found this listing: http://arxiv.org/abs/0805.4758 It seems that both DOIs are associated with the same paper. One of the paper itself and another is an errata for the paper!
I'm thinking that we might get high fitness by having a special rule in the parser for splitting characters like "+&?". If we see them right before some whitespace or a new DOI_START, then stop reading the DOI.