Closed GoogleCodeExporter closed 9 years ago
Have you tried set operations? You'll need to use the VERSION1 flag ("(?V1)").
This:
"+utf8::P\n+utf8::S\n-002D\n-002E"
is the character set:
[\p{Punct}\p{Space}--\-.]
(I think!).
Also, have a look at the WORD flag ("(?w)") which will make \b match on default
Unicode word boundaries instead of the usual simple word boundaries. You'll
need to use the VERSION1 flag:
>>> text = "can't aujourd'hui l'objectif"
>>> print(regex.split(r'(?V1w)\b', text))
['', "can't", ' ', "aujourd'hui", ' ', "l'", 'objectif', '']
The VERSION1 flag is needed because the regex module tries to be backwards
compatible with the re module which supports only simple sets and won't split
on zero-width matches such as word boundaries.
Original comment by re...@mrabarnett.plus.com
on 10 Apr 2014 at 2:02
You are right, sets do the job ! actually, even without the V1 flag. thank
you very much for your help.
Best,
--Nabil Hathout
Original comment by nabil.ha...@gmail.com
on 10 Apr 2014 at 2:24
Glad to be of help!
Original comment by re...@mrabarnett.plus.com
on 10 Apr 2014 at 2:55
Original issue reported on code.google.com by
nabil.ha...@gmail.com
on 10 Apr 2014 at 10:26