mediawiki-utilities / python-mwcites

MIT License
38 stars 11 forks source link

Recognize ISBNs with spaces as valid numbers #13

Closed kodchi closed 6 years ago

kodchi commented 6 years ago

Some pages, e.g. [1], contain ISBNs with spaces, e.g. 2 10 004179 7. The patch identifies these ISBNs as valid numbers.

[1] https://fr.wikipedia.org/wiki/Mahmoud_Sami-Ali?oldid=145625233

kodchi commented 6 years ago

Testing the patch on real data. I'll add the results here.

kodchi commented 6 years ago

The patch seems to have introduced a bug. I'm inspecting the results.

kodchi commented 6 years ago

I've updated the patch and re-ran it against frwiki. Before the patch, we had 389,669 ISBNs, and after the patch, we have 373,672. The reduction is coming from correctly discarding non-ISBNs. For example, on https://fr.wikipedia.org/w/index.php?title=Mohamed_Seghir_Boushaki&action=edit&oldid=126284035, the pre-patch script found 3 ISBNs. Here are they:

10019855 Mohamed Seghir Boushaki 126284035 2016-05-18T14:04:56Z isbn 9782707173263 10019855 Mohamed Seghir Boushaki 126284035 2016-05-18T14:04:56Z isbn 2 10019855 Mohamed Seghir Boushaki 126284035 2016-05-18T14:04:56Z isbn 3

After the patch, the bottom two are ignored, so we have one ISBN instead of the previous three.