Support for UTF-8 encoding

holzschu / ios_system

Drop-in replacement for system() in iOS programs

BSD 3-Clause "New" or "Revised" License

902 stars 147 forks source link

Support for UTF-8 encoding #21

Closed ian-mcdowell closed 6 years ago

ian-mcdowell commented 6 years ago

OpenTerm passes UTF-8 commands to ios_system, and expects UTF-8 output from commands. However, argument parsing & command parsing converts the string to ASCII, which causes data loss.

Try: echo 😀 in OpenTerm to see results.

holzschu commented 6 years ago

That is a very good question. I'll look into it, but most Unix commands expect a C-string as input. There might not be a way to keep UTF8 characters through the entire pipeline.

At some point, the arguments have to be (char argc, char **argv). I do not know of a way to do UTF8 -> char -> UTF8 while keeping UTF8 information. I'll have to learn.

ian-mcdowell commented 6 years ago

-[NSString cStringUsingEncoding:NSUTF8StringEncoding]?

holzschu commented 6 years ago

Sorry, strike my previous answer. ios_system() operates on char*, because system() operates on char*. There is no operation on the encoding inside ios_system(). The conversion was done in CommandExecutor, where there is the line ios_system(command.utf8CString) which is equivalent to [command cStringUsingEncoding:NSUTF8StringEncoding] (but shorter).

We need to track the result of this command.utf8CString conversion, but my guess is that it removed the UTF-8 content to make it compatible with C-like strings. This conversion cannot be reversed.

ian-mcdowell commented 6 years ago

Take a look at parseArgument. You’re converting the arguments to ASCII.

If you take a UTF-8 C string (char *), then call +[NSString initWithCString:string encoding:NSASCIIStringEncoding] to convert it to an NSString, you have lost the UTF-8 characters. Even if you later later call -[NSString cStringWithEncoding:NSUTF8StringEncoding], they will be gone.

ian-mcdowell commented 6 years ago

All that should be necessary to fix this (I've verified locally) is to change all occurrences of NSASCIIStringEncoding to NSUTF8StringEncoding in ios_system.m

diff.txt

However, if we want to support other encodings than UTF-8 in the future, more work would be needed.

holzschu commented 6 years ago

You're absolutely right. It works now. Gonna make a short PR for that. The effects are... interesting. Not all commands work, but most of them do.

ian-mcdowell commented 6 years ago

Looks like ls fails, that's it? Overall, looks like a nice improvement to what we currently have.

ian-mcdowell commented 6 years ago

Investigating ls. Looks like the relevant code is at file_cmds_ios/ls/util.c line 113. iswprint(wc) fails, so it prints a ?

Removing that iswprint check allows emoji to display, but I'm sure there's a reason that check is there.

holzschu commented 6 years ago

According to the doc, it should reply true if the character is printable, according to the defined locale: http://en.cppreference.com/w/c/string/wide/iswprint

ian-mcdowell commented 6 years ago

Do we need to define LANG, LC_ALL, etc in the environment?

holzschu commented 6 years ago

Defining LC_ALL to en_US.UTF-8: doesn't work. Calling setlocale(LC_ALL, "en_US.UTF-8") in iOS_system.m: doesn't work. Calling setlocale(LC_ALL, "en_US.UTF-8") in ls.c: doesn't work. It does work on my mac.

ian-mcdowell commented 6 years ago

What does setlocale(LC_ALL, NULL) return? That should get you the current locale that C is using.

holzschu commented 6 years ago

It says "C", which explains the behaviour of iswprint but is inconsistent with the LC_ALL. Because setlocale doesn't like en_US.UTF-8.

Update: found a similar question on stackoverflow: https://stackoverflow.com/questions/32594377/setlocale-works-in-ios-simulator-but-fails-on-device

setlocale depends on access to /usr/share/locale, which is not possible in the sandbox. I'm going to deactivate all iswprint calls.

holzschu commented 6 years ago

Fixed with 6c22c28. I guess I can close this issue now?