Closed rotsix closed 4 years ago
Yep, I think that's a reasonable change - what class(es) would you use? maybe Letter and Punctuation?
On Mon, 8 Jun 2020 at 03:15, Victor Franzi notifications@github.com wrote:
Hey there.
The regular expression in charge of invalid characters removal (this one https://github.com/jpillora/media-sort/blob/master/sort/strings.go#L22) is a bit too aggressive for some characters such as accentuated ones, cedilla, amperstand, etc.
A dummy way around would be to include these characters in the invalidChars regexp. A prettier solution is to use the unicode character property http://www.regular-expressions.info/unicode.html#prop class which seems to be included https://github.com/google/re2/wiki/Syntax in Go regexp module.
What do you think about it?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jpillora/media-sort/issues/24, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2X45QBSHGGHFRKMUXG6TRVPDKBANCNFSM4NXODOFQ .
Considering the go doc, it seems to be \pN
, but the unicode doc says \p{L}
.
says you can use class names as well it seems "\p{Greek}" just thinking of the ampersand case you mentioned
On Mon, 8 Jun 2020 at 03:30, Victor Franzi notifications@github.com wrote:
Considering the go doc https://github.com/google/re2/wiki/Syntax, it seems to be \pN, but the unicode doc http://www.regular-expressions.info/unicode.html#prop says \p{L}.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jpillora/media-sort/issues/24#issuecomment-640252523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2X4ZWZR6NEQISZJ5FSHLRVPFEZANCNFSM4NXODOFQ .
According to the go doc, using \p{Greek}
and \p{L}
should be enough for most characters, whilst &
may require to be explicited.
Hey there.
The regular expression in charge of invalid characters removal (this one) is a bit too aggressive for some characters such as accentuated ones, cedilla, amperstand, etc.
A dummy way around would be to include these characters in the
invalidChars
regexp. A prettier solution is to use the unicode character property class which seems to be included in Goregexp
module.What do you think about it?
Edit: related class is
\p{L}