Null terminated conversion

palako commented 3 years ago

When using the decode method, I'm passing a UCS-2 encoded buffer that contains a 0x0000 (u16) indicating the end of the string. The buffer is fixed size, but whatever happens after the 0x0000 is undefined, and is normally 0xFFFF. This results in the 0xFFFF getting decoded into a 3 bytes character, which eventually leads to an overflow in the output buffer (output buffer is an u8 array of twice the length of the input. For context, this scenario happens when reading the directory entries in a FAT16 filesystem.

To mitigate this I've had to write code to check if there's a null byte in the input buffer and if so, pass to decode only the slice that doesn't contain it. I'm not entirely unhappy with this solution but it took me many many hours of debugging and eventually stepping into the library code to find out what the problem was, so I thought it'd be nice for the library to either take this into account or provide another method that does. Something like decode_to_end_of_string or something like that. The result count of bytes would be the numbers of bytes converted till the null byte. Actually, I would do this the default and have the option of go past the null bites to decode the buffer in full, but I don't know enough about UCS-2 or UTF-8 to know which one should be the default behaviour.

I'm happy to write a pull request moving the code I wrote and a test if that helps.

Thanks for your crate!

GabrielMajeri commented 3 years ago

Thanks for the bug report!

First, regarding the overflow situation: in theory, a directory entry could be made up completely of 3-byte characters, so twice the length of the input might not be enough anyway :thinking:

Regarding the portion after the null terminator, it contains undefined characters (0xFFFF is not valid Unicode) and could be considered an invalid input by the function. A sensible solution would be to emulate the behavior of the standard Rust function str::from_utf8, which returns an error if the input string is invalid. We could update the decode function to handle the 0xFFFF invalid character.

palako commented 3 years ago

Yes, you're absolutely right that my output buffer should be at least three times the input! Maybe something to be added to the method documentation since it's not common knowledge and a potential source of buffer overruns problems, which was my case. I also agree with the str::from_utf8 idea. That's actually what I'm doing in my code, I'm reading ucs2 with decode, going from buffer of u16 to buffer of u8, only to then have the contents in a string slice by str::from_utf8. It was cool to learn everything I did in the progress of finding my problem, but from a being productive point of view, it would have been a lot more convenient to have a method that returns a &str from a ucs-2 encoded array of u16. I don't even think what I reported is a bug, there could be a case when someone just wants everything in the input decoded into utf-8 regardless of string termination characters. I would treat this more like an extra feature and maybe some documentation clarification, or even a restriction, forcing that input array to be at least 3 times the input.

GabrielMajeri commented 3 years ago

I've created a PR which updates the documentation of decode/decode_with to describe the maximum potential size of the output.

As for providing a convenience function which converts an array of u16 to a &str, if you open a PR with your proposed implementation, then we can check it out and see how it fits with the rest of the API :)

rust-osdev / ucs2-rs

Null terminated conversion #9