Closed tim-janik closed 9 months ago
This blog post has some interesting pointers how to deal with non-UTF8 file names: https://beets.io/blog/paths.html Related: https://news.ycombinator.com/item?id=16991263 https://vstinner.github.io/python37-new-utf8-mode.html https://stackoverflow.com/questions/51001150/what-are-surrogate-characters-in-utf-8/51051607#51051607
This is not a trivial issue, latest update:
Python automatically converts 0x80..0xff in file paths to surrogate codes in the range 0xdc80..0xdcff. When using file paths, it does the reverse conversion. The functions os.fsdecode() and os.fsencode() accomplish this.
However, surrogate code points are only used in UTF-16 where they come in pairs to allow UTF-16 to encode values larger than 0xffff, but not needed in the variable width encoding UTF-8, so surrogate code points are not allowed in UTF-8. Thus, the often cited UTF-8 Decoder by Bjoern Hoehrmann rejects surrogates in UTF-8. This is the validator used by websocketpp and thus by us.
Even if we extend the validator to allow surrogates in UTF-8, which is what Python does PEP-383, there is no guarantee that all WebSocket implementations our UI will encounter will allow these per-Spec invalid UTF-8 sequences. So essentially, we have to find a different range of characters to map 0x80..0xff to, that range must yield valid UTF-8 and be unlikely to collide with any other uses or apps in practice.
What if we just converted every byte seperately for transfer over web socket pp? I.e.
"hello 0x88 world" => "hello \u0088 world"
That would preserve the bytes exactly as they are.
What if we just converted every byte seperately for transfer over web socket pp? I.e.
"hello 0x88 world" => "hello \u0088 world"
That would preserve the bytes exactly as they are.
A "byte" is not valid unicode. The sequence you use above "\u0088" is not a byte but a unicode code point that gets UTF-8 encoded as "\xed\xb2\x88". But per spec, the encoding only works for valid code points, the surrogate used by Python would be 0xdc88, which using the UFT-8 encoding would give "\xed\xb2\x88", but that is not valid UTF-8 because 0xdc88 is not a valid code point outside of UTF-16. Btw, "\u0088" compiles fine in C++, but "\udc88" yields "error: invalid universal character" for the same reason (the C++ compiler generates valid UTF-8 sequences from \uNNNN).
Also, we don't control what the receiving end of the websocket does, that is up to the browser implementations and the browsers JSON parser. See also: UTF-8 Codepage Layout
Ok, maybe my description of what I am suggesting was not great. A filename on linux is a sequence of bytes. Every combination is allowed. So a filename that contains a "\x88" is allowed. If a filename happens to be valid utf-8 that is great, but it doesn't have to.
So what we want to have is a pair two functions: one that maps a filename to a string that is guaranteed to be valid utf-8 and one that maps the result back to the original filename. Note that these functions do not necessarily need to map valid utf-8 input to the same valid utf-8 output, and I don't really think this is necessary for solving our problem here.
My suggestion was simply to map every byte of the filename B (between 0x01 and 0xff) to the unicode code point B. Of course for a utf-8 input string this would not preserve the string, but map every part of the string that really uses characters >= 0x7f due to utf-8 encoding to something crappy. But on the other hand at the point where we need the filename again, we can reverse the process and get the exact same set of bytes. And we could make some third "display name" function that undoes the mapping in the UI.
Basically, it is a variant of the encoding the browers use for URIs. You could also use "hello%20%88%20world" for the filename "hello 0x88 world" and this would also work, because you could convert back and forth without loss, and utf-8 and invalid utf-8 would be representable. This is also an encoding we could use and which would solve the problem.
Also, we don't control what the receiving end of the websocket does, that is up to the browser implementations and the browsers JSON parser.
We do not need to control what the receiving end of the websocket does, because both suggestions map all possible filenames to a sequence of valid utf-8 codepoints (and back). A websocket implementation that kills such chars would also kill every code point between 0x80 and 0xff, and thus be broken.
Btw, Javascript uses UTF-16, that is another reason the 0xdc80+ abuse could never have worked. The UTF-8 Wikipedia page mentions MirBSD OPTU-8/16 encoding, which converts bytes to U+EF80...U+EFFF in a Private Use Area.
That actually works for our purposes, it is just painful to adapt all functions that do file IO (which are fairly few in Anklang due to the IPC abstraction we have, but still). I managed to implement a displayfs(filename)
function in pure Javascript via Regexp that maps 0xEF80+ bytes back to Latin1. This makes things significantly easier, as we don't have to pass a filename and a display string through the IPC API now. I'll probably get rid of Resource.label
entirely in the IPC API and just use Resource.uri
everywhere, which is proper unicode and already used by the UI anyway for simple path manipulations. The actual label text can be generated via basename() and displayfs().
Using the file browser, browsing directories that contain files with invalid UTF-8 in the file name causes websocketpp to close the connection with error::invalid_payload.
Note that in MODE=devel this simply restarts the GUI, so it can look like a "UI reload" on directory changes. Thus, debugging needs a MODE=prod build, with this the UI looks like it takes forever to "load" directory contents, even though the connection died.
AFAICS,
WebSocketConnection::send_text()
is getting called with a JSON message that contains invalid UTF-8. Example follows.There are several issues at play here: 1) We need to catch connection shutdown in the UIs, electron should probably print a message and exit. Web browsers could display
<body>Connection lost</body>
or similar. 2) The file sizes are negative, probably a signedness bug that needs fixing. 3) We need a way to deal with invalid UTF-8 in file names. There is no way to marshal such invalid UTF-8 names through JSON+WebSocket, and we are already using binary messages for DSP telemetry updates.So marshalling the JSON messages as binary is out of the question. Just ignoring non UTF-8 file names on disk reads without feedback for the user would also suck, it'd look like Anklang cannot see all files/dirs.
I don't know of any established defacto mapping for non-UTF8 ondisk-name -> valid-UTF8 -> ondisk-name that would preserve basic file IO browsing functionality for users. What do other applications do?