quixdb / squash

Compression abstraction library and utilities
https://quixdb.github.io/squash/
MIT License
406 stars 53 forks source link

Provide wide character versions of functions which use strings #158

Closed nemequ closed 9 years ago

nemequ commented 9 years ago

Squash doesn't actually use strings too heavily, but there are still a lot of places where we apparently need to expose wchar_t versions to be nice to people writing internationalized Windows programs.

Obviously that last one is the major issue. These are just supposed to be convenience wrappers for programmers, but if we have to provide wide-char versions it is a major headache… It may be better to just remove them and move the SquashCodec* codec versions to the short ones. It's not like it's particularly difficult to call squash_get_codec once, it encourages people to do the right thing instead of wasting cycles on implicit lookups from passing around codecs as strings. Need to make a determination soon, though, since it would be a pretty big API/ABI break.

For functions which will have wchar_t versions, we should provide text-generic mappings.

Dr-Emann commented 9 years ago

I vote for the removal of the convenience versions.

nemequ commented 9 years ago

I vote for the removal of the convenience versions.

Good to know, thanks. One of the reservations I had about this was the extra work such a change would create for you.

jibsen commented 9 years ago

I think if you can avoid this, it will probably save you a lot of heacache.

Allowing the codec names to be more than ASCII is unnecessary. Also, tchar.h is Microsoft specific.

I have a wchar_t version of parg, but did not release it because the use seemed limited.

nemequ commented 9 years ago

I think if you can avoid this, it will probably save you a lot of heacache.

I think the best way to (mostly) avoid the issue is to drop the convenience versions. Then you have one function which needs to accept wide char versions (squash_get_codec) instead of scattering them all over the API.

Allowing the codec names to be more than ASCII is unnecessary.

I agree, but allowing people to look up codec names in the charset for their current locale makes sense.

Also, tchar.h is Microsoft specific.

RIght, we would have to provide a text-generic header for Squash just like Windows provides tchar.h for the Windows API.

Another interesting possibility is to use C11 generic selections to overload the short versions (i.e., have squash_foo invoke either squash_foo(SquashCodec*), squash_foo_name(const char*), or squash_foo_wname(wchar_t*) depending on how it is called.

I'm pretty confident MSVC doesn't support generic selections right now, but it only took them until VS 2015 to support C99, so maybe in 2027 they'll support C11…

I have a wchar_t version of parg, but did not release it because the use seemed limited.

I'm not sure about that; it would be useful for anyone using the command line with a 16-bit charset for their locale…

jibsen commented 9 years ago

I'm not sure about that; it would be useful for anyone using the command line with a 16-bit charset for their locale…

On Windows there are API calls to get the unicode argv, or you can use wmain. I don't know how that would work on linux.

There are some SO answers here and here related to this question. It seems a little confusing with the difference between tab-completion and direct entry, but it sounds like you can just use regular char argv and it will mostly work for filenames.

nemequ commented 9 years ago

I haven't ever written an application with an internationalized CLI, but my understanding is that argv is in the encoding specified by the system locale. You can use iconv to convert to UTF-8 if you need (which is typically stuffed in a char*). That said, I have no idea what the Windows equivalent of iconv is…

My main concern isn't the CLI, it is that the library be easily accessible for internationalized applications. AFAICT we basically have two choices; first, for for the few functions which accept or return a string to convert between the locale's encoding and UTF-8, which we would use internally. The second possibility is to simply document that all functions only accept/return UTF-8, and leave it up the the caller to convert if necessary.

My current inclination is to make everything UTF-8, but TBH I'm not really sure which is considered to be a better practice. In some places, this would mean that we would have to convert back to the locale's encoding before passing the value to another function (e.g., in squash_file_open_codec_with_options before calling fopen).

For Windows, my understanding (and, just a reminder, I'm pretty ignorant when it comes to Windows) is that, as a library, the right thing to do would be to provide a variant of the function which takes/returns wchar_t* instead of char*. These would generally just be wrappers which convert between UCS-2 (for the wchar_t* versions) and UTF-8 (for the char* versions) and call the char* versions…

nemequ commented 9 years ago

That said, I have no idea what the Windows equivalent of iconv is…

Thank you glib. Due to the Google Code shutdown the project is now on GitHub, should work well as a submodule.

nemequ commented 9 years ago

This is probably true, if they can be looked up from some list, static pointers, constants, or defined macros. But, if they are user defined they may as well contain non-ASCII characters. In this case, the library would have to require those strings to be UTF-8 encoded or accept wchar_t strings. This is up to the designer/author of the library.

I think it is safe to require codec names be ASCII. They are determined by people writing plugins, which really isn't an end-user activity. Right now, codec names are restricted to alphanumeric and '-', and I think that's a pretty reasonable restriction.

The reason this is currently a big issue is we have tons of functions which look like this:

void squash_do_something_codec (SquashCodec* codec) {
  // …
}

void squash_do_something (const char* codec) {
  squash_do_something_codec (squash_get_codec (codec));
}

I've pretty much decided to just get rid of the wrappers, and have squash_do_something (SquashCodec* codec) instead. So the only place this is really an issue is the squash_get_codec function.

That said, I have no idea what the Windows equivalent of iconv is...

These are mbstowcs() and wcstombs(). Or, MultiByteToWideChar() and WideCharToMultiByte().

I rather like the idea of just using win-iconv. It should help keep the code a bit cleaner since we wouldn't have #ifdefs sprinkled everywhere, and the interface is a lot simpler.

My current inclination is to make everything UTF-8, but TBH I'm not really sure which is considered to be a better practice.

I would not recommend this. I would rather recommend to bite the bullet and provide ANSI (char) and UNICODE (wchar_t) versions for proper Windows support, especially as a library.

Sorry, looking back I realize I wasn't very clear about this. What I was talking about is whether the char* interface should assume you're giving it UTF-8 data or data in the locale's encoding. There would still be wchar_t* wrappers for those functions, the question is just whether those wrappers convert from UCS-2 to UTF-8 or the locale's encoding when they call the char* versions.

nemequ commented 9 years ago

I've been working on this in the wip/api branch. It's mostly finished, but the current version is causing a failure on OS X. My current theory is that the problem is in the charset conversion code, but debugging it is tricky without access to OS X; I'm hoping #164 will make it a bit easier, so I plan to work on that next.

nemequ commented 9 years ago

Got this working on OS X (probably only because I bypassed squash_file_vwprintf, that function still definitely needs some unit tests) and pushed to master. If anyone has any objection to the API please speak up soon.