Filename too long - not checking length correctly when unicode involved

thezoggy commented 21 hours ago

SABnzbd version

4.3.3

Operating system

debian

Using Docker image

linuxserver

Description

person on discord reporting running into file length issue on debian with ext4 drive using sab

logs (info level, requested debug): https://privatebin.net/?94612123a31d9fbd#GKFHewMTiLL2LDvAkMDCMf1MSGSmDLgpCFsasAUkj4Ys

able to get a debug snippet of a retry:

2024-12-02 18:33:32,402::DEBUG::[interface:144] Request POST /api from ::ffff:172.18.0.1 [Sonarr/4.0.11.2680 (alpine 3.20.3)] {'mode': 'addfile', 'cat': 'sonarr', 'priority': '-100', 'REMOVED': '<REMOVED>', 'output': 'json', 'name': <cherrypy._cpreqbody.Part object at 0x7f31bc553440>}
2024-12-02 18:33:32,402::INFO::[nzbparser:87] Attempting to add [LostYears] Re - ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs)  Re - Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06  Reï¼šã‚¼ãƒã‹ã‚‰å§‹ã‚ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±  Re-ZERO  ReZero.nzb
2024-12-02 18:33:32,403::INFO::[filesystem:747] Creating directories: /config/Downloads/incomplete/[LostYears] Re - ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs)  Re - Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06  Reï¼šã‚¼ãƒã‹ã‚‰å§‹ã‚ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±  Re-ZERO  ReZero
2024-12-02 18:33:32,403::INFO::[notifier:157] Sending notification: Error - Failed making (/config/Downloads/incomplete/[LostYears] Re - ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs)  Re - Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06  Reï¼šã‚¼ãƒã‹ã‚‰å§‹ã‚ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±  Re-ZERO  ReZero) (type=error, job_cat=None)
2024-12-02 18:33:32,403::ERROR::[filesystem:769] Failed making (/config/Downloads/incomplete/[LostYears] Re - ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs)  Re - Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06  Reï¼šã‚¼ãƒã‹ã‚‰å§‹ã‚ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±  Re-ZERO  ReZero)
Traceback (most recent call last):
  File "/app/sabnzbd/sabnzbd/filesystem.py", line 763, in create_all_dirs
    os.mkdir(path_part_combined)
OSError: [Errno 36] Filename too long: '/config/Downloads/incomplete/[LostYears] Re - ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs)  Re - Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06  Reï¼šã‚¼ãƒã‹ã‚‰å§‹ã‚ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±  Re-ZERO  ReZero'

looking at the nzb he mentioned he shows it grabbed (provided via discord), I see per the nzb contents:

<meta type="title">[LostYears] Re: ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs) | Re: Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06 | Re：ゼロから始める異世界生活 - 第56話 | Re:ZERO | ReZero</meta>

so while there is 240 characters its 269 length (due to unicode).

subject length is even worse:

<file [redacted] subject="[LostYears] Re: ZERO, Starting Life in Another World - S03E06 (CR WEB-DL 1080p x264 E-AC-3) [Dual-Audio, Multi-Sub] (Japanese, English Dubs) | Re: Zero Kara Hajimeru Isekai Seikatsu Season 3 - 06 | Re：ゼロから始める異世界生活 - 第56話 | Re:ZERO | ReZero [2/9] - &quot;[LostYears] Re ZERO, Starting Life in Another World - S03E06 (WEB-DL 1080p x264 E-AC-3) [4087E1C7].par2&quot; yEnc (1/1) 22096"

and per config has:

max_foldername_length = 246

so guessing just fallout of checking char length? as him trying max_foldername_length of 200 it worked fine..

thezoggy commented 21 hours ago

ok him doing 200 and it worked fine... so makes my assumption that sab thought it was 240 not 269 so didnt sanitized (as 240 < 246).. but forced it when doing 200..

attaching user log showing that (and its debug level this time) 2024-12-02--sabnzbd.log

Safihre commented 8 hours ago

I expected this part to fix that: https://github.com/sabnzbd/sabnzbd/blob/cc402148187216622aa44bc5a8bc06354cce01b8/sabnzbd/filesystem.py#L244-L261

It seems that code is also triggered but didn't shorten it enough. @sanderjo

Safihre commented 7 hours ago

Oh scratch that. The problem isn't the filename, it's the folder name. We don't sanitize the folder name like this. I guess we should?

jcfp commented 7 hours ago

so guessing just fallout of checking char length? as him trying max_foldername_length of 200 it worked fine..

Yeah, the 15 japanese characters in the title have unicode code points that take 3 bytes when encoded as utf-8, so len() encoded in bytes will be 30 more than for the string.

I guess we should?

Yep.

jcfp commented 7 hours ago

Couldn't help but notice the code linked above dealing with long filenames makes rather short work of anything that isn't ascii. A filename written entirely in japanese could even end up getting nixed at https://github.com/sabnzbd/sabnzbd/blame/cc402148187216622aa44bc5a8bc06354cce01b8/sabnzbd/filesystem.py#L249?

Safihre commented 7 hours ago

Yeah... Maybe we should use replace instead of ignore?

sanderjo commented 7 hours ago

@jcfp like this?

>>> a = "帝帝帝帝帝"
>>> len(a)
5
>>> str.encode(a)
b'\xe5\xb8\x9d\xe5\xb8\x9d\xe5\xb8\x9d\xe5\xb8\x9d\xe5\xb8\x9d'
>>> len(str.encode(a))
15
>>>

15 bytes is correct: the size towards the filesystem.

Has effect on garbage strings too. Probably it's seen / encoded as UTF8 in the code below. Maybe inside SABnzbd it's treated differently?

>>> a = "Season 3 - 06  Reï¼šã‚¼ãƒã�‹ã‚‰å§‹ã‚�ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±"
>>> len(a)
67
>>> len(str.encode(a))
126

jcfp commented 6 hours ago

Yeah... Maybe we should use replace instead of ignore?

That would avoid the name being reduced to nothing, although an all-questionmarks filename isn't ideal either.

Any reason for forcing ascii here instead of name.encode("utf-8", errors="replace")[:max_len].decode(errors="ignore") which keeps the filename, enforces the limit in bytes, and discards the multi-byte character at the end in case it got cut off?

@sanderjo

Maybe inside SABnzbd it's treated differently?

Just unicode, try for yourself:

a = "Season 3 - 06  Reï¼šã‚¼ãƒã�‹ã‚‰å§‹ã‚�ã‚‹ç•°ä¸–ç•Œç”Ÿæ´» - ç¬¬56è©±"
for item in a: print(item, ":", len(item.encode("utf-8")), ord(item), item.encode("utf-8"))

sabnzbd / sabnzbd