custom character properties

GoogleCodeExporter commented 9 years ago

Why do you want this feature? What is your use-case?

I want to segment French text into words

What should the syntax or call look like?

something like perl custom character properties 
"+utf8::P\n+utf8::S\n-002D\n-002E"

Do any other regex implementations have something like this?

yes, perl regex does.

Please provide any additional information below.

I would be very happy if I could define my own own unicode properties as 
(ponctuations and spaces but not dash nor point).

thank you for this very useful module.

Original issue reported on code.google.com by nabil.ha...@gmail.com on 10 Apr 2014 at 10:26

GoogleCodeExporter commented 9 years ago

Have you tried set operations? You'll need to use the VERSION1 flag ("(?V1)").

This:

    "+utf8::P\n+utf8::S\n-002D\n-002E"

is the character set:

    [\p{Punct}\p{Space}--\-.]

(I think!).

Also, have a look at the WORD flag ("(?w)") which will make \b match on default 
Unicode word boundaries instead of the usual simple word boundaries. You'll 
need to use the VERSION1 flag:

>>> text = "can't aujourd'hui l'objectif"
>>> print(regex.split(r'(?V1w)\b', text))
['', "can't", ' ', "aujourd'hui", ' ', "l'", 'objectif', '']

The VERSION1 flag is needed because the regex module tries to be backwards 
compatible with the re module which supports only simple sets and won't split 
on zero-width matches such as word boundaries.

Original comment by re...@mrabarnett.plus.com on 10 Apr 2014 at 2:02

GoogleCodeExporter commented 9 years ago

You are right, sets do the job ! actually, even without the V1 flag. thank
you very much for your help.

Best,

--Nabil Hathout

Original comment by nabil.ha...@gmail.com on 10 Apr 2014 at 2:24

GoogleCodeExporter commented 9 years ago

Glad to be of help!

Original comment by re...@mrabarnett.plus.com on 10 Apr 2014 at 2:55

Changed state: Done

jamadden / mrab-regex-hg

custom character properties #110