openzim / python-scraperlib

Collection of Python code to re-use across Python-based scrapers
GNU General Public License v3.0
17 stars 16 forks source link

Metadata length validation is buggy for unicode strings #158

Closed benoit74 closed 23 hours ago

benoit74 commented 2 months ago

The specification specifically says that we must validate the number of characters (looks like graphemes would be even a more correct term).

Currently scraperlib is using the len function which is not counting the number of graphemes (what we want to validate because they are the visually perceived thing) but the number of code points (which is not what is visually perceived).

Looks like (according to ChatGPT, let's be honest) we could use the grapheme library. Not sure this is the appropriate idea since this lib seems barely maintained / released in a proper manner.

import grapheme

print(len("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 41 => Wrong
print(grapheme.length("विकी मेड मेडिकल इनसाइक्लोपीडिया हिंदी में"))  # Outputs: 25 => Correct
benoit74 commented 2 months ago

Another alternative (ChatGPT again) which seem a bit more maintained is the regex library (not the re standard library)

import regex
len(regex.findall(r'\X', text))   # Outputs: 25 => Correct

https://pypi.org/project/regex/

rgaudin commented 2 months ago

We've used the regent approach and an icu based one in the past. I think it's mandatory to clarify spec requirement first. Matches one of the topic of the hackathon as well!

benoit74 commented 2 months ago

Yesterday discussions in the Hackathon were quite clear, but I can only agree that requirement must be clear first ^^

rgaudin commented 2 months ago

Then that's OK. Would be good to add clarity to the spec Wiki as well.

benoit74 commented 3 days ago

Spec has been updated during the hackathon to specify that we need to count graphemes: https://wiki.openzim.org/wiki/Metadata