mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

Correctly parse ISBNs with space #12

Closed kodchi closed 6 years ago

kodchi commented 6 years ago

Sometimes, e.g. [1], ISBN numbers are typed erroneously as "isbn=2 906700-09-6" (notice the space after 2). This patch recognizes such ISBNs as valid.

[1] https://fr.wikipedia.org/w/index.php?diff=prev&oldid=83778617

halfak commented 6 years ago

I think we need some tests to see how this works in practice. E.g. does this cause more mistakes than it fixes? Have you already run some extractions to test it out?

kodchi commented 6 years ago

Tests are a good idea and I haven't run any. Let's wait on merging until we show this works better than the existing code then.

kodchi commented 6 years ago

I've generated citations for frwiki dumps dated 01/01/2018. Before the patch was applied, there were 389,264 ISBNs; while after the patch is applied there are 389,669 ISBNs, an increase of 405. @halfak do you see any need for further analysis?

halfak commented 6 years ago

Hmm. This change should not have altered the total number of ISBNs that are matched, should it have?

kodchi commented 6 years ago

I think the difference is coming from cases when multipe ISBNs start with the same number. For example, if two ISBNs are 2 07 and 2 10, then before the patch, they would have been considered as one ISBN — 2.

I also noticed that the patch should not be as strict as it's now because there are some cases when multiple spaces are used instead of hyphens. For example, https://fr.wikipedia.org/wiki/Mahmoud_Sami-Ali has ISBN 2 10 004179 7.

halfak commented 6 years ago

OK good enough for me.

halfak commented 6 years ago

Woops. Missed the second half of your comment. Want to submit a follow-up?

kodchi commented 6 years ago

Here's a follow up patch: https://github.com/mediawiki-utilities/python-mwcites/pull/13