thombashi / pathvalidate

A Python library to sanitize/validate a string such as filenames/file-paths/etc.
https://pathvalidate.rtfd.io/
MIT License
220 stars 13 forks source link

Unicode en dash (u"\u2013") Is Not Replaced By sanitize_filename #30

Closed kenlerner closed 1 year ago

kenlerner commented 1 year ago

When running the following: sanitized = sanitize_filename(txt, platform="Windows")

If the variable txt contains a unicode dash an invalid sanitized filename is returned. The unicode dash is not replaced. An error occurs when a filename is opened using the sanitized filename.

The following change works: sanitized = sanitize_filename(re.sub(u"\u2013", "-", txt), platform="Windows")

I think the function should remove the unicode en dash and replace it with an ascii dash.

thombashi commented 1 year ago

@kenlerner Thank you for your feedback.

Could I ask what made you think Unicode dash is an invalid character for a filename? Unicode normalization (NFC, NFKC, NFD, NFKD) would leave Unicode dashes as it is.

I understand that Unicode dashes are confusing for file names, but still, that is a valid character for file names.

kenlerner commented 1 year ago

Python created an exception when trying to create a file when the filename had a unicode dash in it. Error was same as reported here: https://stackoverflow.com/questions/55867822/when-running-python-script-i-get-%C3%A2%E2%82%AC-instead-of-a-hyphen

thombashi commented 1 year ago

I can create files that name includes an unicode dash by Python. If that exception happens only at a specific Python version, please upgrade Python or report the problem to the official Python team.

And the topic at the link does not seem to be a filename problem, just that they have mixed used ASCII-dash and Unicode-dash as dictionary keys.