thombashi / pathvalidate

A Python library to sanitize/validate a string such as filenames/file-paths/etc.
https://pathvalidate.rtfd.io/
MIT License
210 stars 12 forks source link

max_len with sanitize_filename behaves incorrectly with multi-byte unicode chars #47

Closed 7x11x13 closed 2 weeks ago

7x11x13 commented 2 weeks ago

Code to reproduce:

from pathvalidate import sanitize_filename, validate_filename

filename = "図彌見視御未味尾微身実箕論學識我遠不及他段齉籲颧饕掱麒麟魑魅魍魉麤𪚥龘 爨馕龘龘憂鬱龘國龘əəəəəəəəə иߤ-кߎ߹𝒴-𝓃߫߯ р✁ту✁ть.mp3"

print("length:", len(filename))
print("bytes:", len(filename.encode()))
sanitized = sanitize_filename(filename, max_len=255)
print("sanitized length:", len(sanitized))
print("sanitized bytes:", len(sanitized.encode()))
validate_filename(sanitized, max_len=255)

gives output:

length: 114
bytes: 270
sanitized length: 114
sanitized bytes: 270
Traceback (most recent call last):
  ...
pathvalidate.error.ValidationError: [PV1101] found an invalid string length: filename is too long: expected<=255 bytes, actual=270 bytes, platform=universal, fs_encoding=utf-8, byte_count=270