Closed kodchi closed 6 years ago
I think we need some tests to see how this works in practice. E.g. does this cause more mistakes than it fixes? Have you already run some extractions to test it out?
Tests are a good idea and I haven't run any. Let's wait on merging until we show this works better than the existing code then.
I've generated citations for frwiki dumps dated 01/01/2018. Before the patch was applied, there were 389,264 ISBNs; while after the patch is applied there are 389,669 ISBNs, an increase of 405. @halfak do you see any need for further analysis?
Hmm. This change should not have altered the total number of ISBNs that are matched, should it have?
I think the difference is coming from cases when multipe ISBNs start with the same number. For example, if two ISBNs are 2 07 and 2 10, then before the patch, they would have been considered as one ISBN — 2.
I also noticed that the patch should not be as strict as it's now because there are some cases when multiple spaces are used instead of hyphens. For example, https://fr.wikipedia.org/wiki/Mahmoud_Sami-Ali has ISBN 2 10 004179 7.
OK good enough for me.
Woops. Missed the second half of your comment. Want to submit a follow-up?
Here's a follow up patch: https://github.com/mediawiki-utilities/python-mwcites/pull/13
Sometimes, e.g. [1], ISBN numbers are typed erroneously as "isbn=2 906700-09-6" (notice the space after 2). This patch recognizes such ISBNs as valid.
[1] https://fr.wikipedia.org/w/index.php?diff=prev&oldid=83778617