qntm / base32768

Binary-to-text encoding highly optimised for UTF-16
MIT License
138 stars 5 forks source link

Case folding in base32768 #21

Closed ncw closed 1 year ago

ncw commented 1 year ago

We've been using your base32768 encoding in rclone via @Max-Sum 's go port https://github.com/Max-Sum/base32768 as a way to encode encrypted file names onto cloud storage systems. This seems particularly effective on OneDrive which seems to use UTF-16 internally.

However we noticed in this issue https://github.com/rclone/rclone/issues/6803 that there are some characters which can be case folded in the set of 32768 characters.

I wrote a little Go program to demonstrate this here: https://go.dev/play/p/SK5G4dnHM6T

This prints stuff like

Duplicate case folded rune ƃ into Ƃ
Duplicate case folded rune ƅ into Ƅ

and comes up with the summary

Found 521 case folding and 199 duplicate case folding characters out of 32896

Which means that there are 521 characters which have a case folded variant, but more importantly there are 199 characters which have both the upper case and lower case variants in the 32768 characters.

This is important because rclone generates file names with these characters and OneDrive is case insensitive. So there is a small chance that two different encrypted file names map to two strings which are the same when compared case insensitively.

Now, I think the probability of this is quite small. The minimum length of a file name is 16 bytes (un-encoded) so 128 bits which makes 9 base32768 characters.

So there is about a 0.006 chance any given character can be case folded. What we'd like to know is how many of these filenames would we have to put in a directory in order to have a 50% chance of having a case folded collision. I've had a few goes at working this out and I'm coming out with an answer of the order of 10²¹. I'm not sure I trust my maths here but its a big number, that I'm sure of.

We've been thinking about making a variant of base32768 which does not include both the upper and lower case versions of any characters which can be case folded.

Any thoughts?

qntm commented 1 year ago

Case sensitivity was not one of the things I took into consideration when selecting the character repertoire for Base32768 (or any of my other related encodings, come to that). Maybe it should have been, but this wouldn't be the first such oversight.

A Base32768 text can be as short as a single character, and its content is anything but random in most cases. Given what you've discovered, I would generally consider it unsuitable for use in case-insensitive scenarios like this one.

In your specific case, though, I think the real question is how you're handling collisions already. If Base32768 was totally robust there would be approximately 327689 ~= 4.36E40 possible unique filenames, yielding a likely collision after... some number of randomly-generated filenames. With what we've learned, we can assume instead that some characters are considered identical, so there are really more like (32768 - 199)9 ~= 4.12E40 possible unique filenames, yielding a likely collision after... some slightly lower number of randomly-generated filenames. I don't think those numbers are substantially different. If you consider the first case to be acceptable and you're not adding code to handle the possibility of collisions, then the second case doesn't look substantially worse, so I wouldn't be inclined to change anything. However, if these are encrypted versions of user-supplied filenames, then, in my opinion, the user has enough theoretical control over the encrypted output filename that they cannot be considered random. Someone could attempt to actively engineer a collision. I think that the possibility of a collision is something rclone needs to handle, regardless of Base32768's robustness.

Assuming that that code is already in place, my recommendation would be to e.g. systematically flatten all your Base32768 filenames to lower case before using them, then trust your existing filename collision handling code.

I will think more about the possibility of creating a Base32768 variant, although I would prefer to avoid that for now. If you do decide to go ahead with a project along those lines, be my guest, but please do not use the name "Base32768" or any derivation of it - avoiding confusion among versions of encodings is something I consider important.

qntm commented 1 year ago

Unless all the filenames are encrypted with the same routine, in which case a collision is theoretically impossible, so you won't have collision handling, in which case you should add some.

ncw commented 1 year ago

Case sensitivity was not one of the things I took into consideration when selecting the character repertoire for Base32768 (or any of my other related encodings, come to that). Maybe it should have been, but this wouldn't be the first such oversight.

I read your article on choosing a safe set of unicode characters - that is a lot of constraints!

A Base32768 text can be as short as a single character, and its content is anything but random in most cases. Given what you've discovered, I would generally consider it unsuitable for use in case-insensitive scenarios like this one.

I'd agree. However the fact that our minimum size of output is 9 characters probably means we can get away with it.

In your specific case, though, I think the real question is how you're handling collisions already.

I don't think collisions are possible with the way we are using base32768. We take a user filename in UTF-8, pad it up to a multiple of 16 bytes using PKCS7 padding, encrypt it using a wide ECB cipher then base32768 encode it. This has no randomness so given the same key, the same filename produces the same encoded output every time. This output is a minimum of 16 bytes long which encodes into 9 base32768 digits.

This is a property we rely on - we can open the file with the encrypted file name directly rather than decoding all the files in the directory to see if we get the right one.

I had real difficulty working out the exact probability of a collision with the 199 characters that case fold. I think it would take around 10²¹ file names (of length < 16 bytes) to create a collision. So I think for rclone, this is probably only a theoretical problem.

Perhaps all is needed is a note in the base32768 docs to say it isn't invariant under case folding.

I will think more about the possibility of creating a Base32768 variant, although I would prefer to avoid that for now. If you do decide to go ahead with a project along those lines, be my guest, but please do not use the name "Base32768" or any derivation of it - avoiding confusion among versions of encodings is something I consider important.

:+1:

PS I enjoyed your books very much :-)

qntm commented 1 year ago

I think the way to go is to add a check for collisions, which throws a big ol' exception, and don't expect it to ever be hit in practice.

PS I enjoyed your books very much :-)

Thanks!