Unicode mentioned a fiddly little rule about splitting between apostrophes and vowels, which Python's regex module faithfully implemented. But in languages where this matters, such as French, it seems they forgot about splitting between apostrophes and the silent "h".
This branch adds another tokenization path that fixes up words such as "l'heure".
It also keeps the apostrophe around when include_punctuation=True, like it sounds like it should.
Unicode mentioned a fiddly little rule about splitting between apostrophes and vowels, which Python's regex module faithfully implemented. But in languages where this matters, such as French, it seems they forgot about splitting between apostrophes and the silent "h".
This branch adds another tokenization path that fixes up words such as "l'heure".
It also keeps the apostrophe around when
include_punctuation=True
, like it sounds like it should.