Open thezoggy opened 21 hours ago
ok him doing 200 and it worked fine... so makes my assumption that sab thought it was 240 not 269 so didnt sanitized (as 240 < 246).. but forced it when doing 200..
attaching user log showing that (and its debug level this time) 2024-12-02--sabnzbd.log
I expected this part to fix that: https://github.com/sabnzbd/sabnzbd/blob/cc402148187216622aa44bc5a8bc06354cce01b8/sabnzbd/filesystem.py#L244-L261
It seems that code is also triggered but didn't shorten it enough. @sanderjo
Oh scratch that. The problem isn't the filename, it's the folder name. We don't sanitize the folder name like this. I guess we should?
so guessing just fallout of checking char length? as him trying
max_foldername_length
of 200 it worked fine..
Yeah, the 15 japanese characters in the title have unicode code points that take 3 bytes when encoded as utf-8, so len() encoded in bytes will be 30 more than for the string.
I guess we should?
Yep.
Couldn't help but notice the code linked above dealing with long filenames makes rather short work of anything that isn't ascii. A filename written entirely in japanese could even end up getting nixed at https://github.com/sabnzbd/sabnzbd/blame/cc402148187216622aa44bc5a8bc06354cce01b8/sabnzbd/filesystem.py#L249?
Yeah... Maybe we should use replace
instead of ignore
?
@jcfp like this?
>>> a = "帝帝帝帝帝"
>>> len(a)
5
>>> str.encode(a)
b'\xe5\xb8\x9d\xe5\xb8\x9d\xe5\xb8\x9d\xe5\xb8\x9d\xe5\xb8\x9d'
>>> len(str.encode(a))
15
>>>
15 bytes is correct: the size towards the filesystem.
Has effect on garbage strings too. Probably it's seen / encoded as UTF8 in the code below. Maybe inside SABnzbd it's treated differently?
>>> a = "Season 3 - 06 Re:ゼãƒã�‹ã‚‰å§‹ã‚�る異世界生活 - 第56話"
>>> len(a)
67
>>> len(str.encode(a))
126
Yeah... Maybe we should use
replace
instead ofignore
?
That would avoid the name being reduced to nothing, although an all-questionmarks filename isn't ideal either.
Any reason for forcing ascii here instead of name.encode("utf-8", errors="replace")[:max_len].decode(errors="ignore")
which keeps the filename, enforces the limit in bytes, and discards the multi-byte character at the end in case it got cut off?
@sanderjo
Maybe inside SABnzbd it's treated differently?
Just unicode, try for yourself:
a = "Season 3 - 06 Re:ゼãƒã�‹ã‚‰å§‹ã‚�る異世界生活 - 第56話"
for item in a: print(item, ":", len(item.encode("utf-8")), ord(item), item.encode("utf-8"))
SABnzbd version
4.3.3
Operating system
debian
Using Docker image
linuxserver
Description
person on discord reporting running into file length issue on debian with ext4 drive using sab
logs (info level, requested debug): https://privatebin.net/?94612123a31d9fbd#GKFHewMTiLL2LDvAkMDCMf1MSGSmDLgpCFsasAUkj4Ys
able to get a debug snippet of a retry:
looking at the nzb he mentioned he shows it grabbed (provided via discord), I see per the nzb contents:
so while there is 240 characters its 269 length (due to unicode).
subject length is even worse:
and per config has:
so guessing just fallout of checking char length? as him trying
max_foldername_length
of 200 it worked fine..