Closed kodchi closed 6 years ago
Testing the patch on real data. I'll add the results here.
The patch seems to have introduced a bug. I'm inspecting the results.
I've updated the patch and re-ran it against frwiki. Before the patch, we had 389,669 ISBNs, and after the patch, we have 373,672. The reduction is coming from correctly discarding non-ISBNs. For example, on https://fr.wikipedia.org/w/index.php?title=Mohamed_Seghir_Boushaki&action=edit&oldid=126284035, the pre-patch script found 3 ISBNs. Here are they:
10019855 Mohamed Seghir Boushaki 126284035 2016-05-18T14:04:56Z isbn 9782707173263 10019855 Mohamed Seghir Boushaki 126284035 2016-05-18T14:04:56Z isbn 2 10019855 Mohamed Seghir Boushaki 126284035 2016-05-18T14:04:56Z isbn 3
After the patch, the bottom two are ignored, so we have one ISBN instead of the previous three.
Some pages, e.g. [1], contain ISBNs with spaces, e.g. 2 10 004179 7. The patch identifies these ISBNs as valid numbers.
[1] https://fr.wikipedia.org/wiki/Mahmoud_Sami-Ali?oldid=145625233