mediasort.invalidChars is too aggressive

rotsix commented 4 years ago

Hey there.

The regular expression in charge of invalid characters removal (this one) is a bit too aggressive for some characters such as accentuated ones, cedilla, amperstand, etc.

A dummy way around would be to include these characters in the invalidChars regexp. A prettier solution is to use the unicode character property class which seems to be included in Go regexp module.

What do you think about it?

Edit: related class is \p{L}

jpillora commented 4 years ago

Yep, I think that's a reasonable change - what class(es) would you use? maybe Letter and Punctuation?

On Mon, 8 Jun 2020 at 03:15, Victor Franzi notifications@github.com wrote:

Hey there.

The regular expression in charge of invalid characters removal (this one https://github.com/jpillora/media-sort/blob/master/sort/strings.go#L22) is a bit too aggressive for some characters such as accentuated ones, cedilla, amperstand, etc.

A dummy way around would be to include these characters in the invalidChars regexp. A prettier solution is to use the unicode character property http://www.regular-expressions.info/unicode.html#prop class which seems to be included https://github.com/google/re2/wiki/Syntax in Go regexp module.

What do you think about it?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jpillora/media-sort/issues/24, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2X45QBSHGGHFRKMUXG6TRVPDKBANCNFSM4NXODOFQ .

rotsix commented 4 years ago

Considering the go doc, it seems to be \pN, but the unicode doc says \p{L}.

jpillora commented 4 years ago

says you can use class names as well it seems "\p{Greek}" just thinking of the ampersand case you mentioned

On Mon, 8 Jun 2020 at 03:30, Victor Franzi notifications@github.com wrote:

Considering the go doc https://github.com/google/re2/wiki/Syntax, it seems to be \pN, but the unicode doc http://www.regular-expressions.info/unicode.html#prop says \p{L}.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/jpillora/media-sort/issues/24#issuecomment-640252523, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE2X4ZWZR6NEQISZJ5FSHLRVPFEZANCNFSM4NXODOFQ .

rotsix commented 4 years ago

According to the go doc, using \p{Greek} and \p{L} should be enough for most characters, whilst & may require to be explicited.

jpillora / media-sort

mediasort.invalidChars is too aggressive #24