Filesystem path UTF-8 handing is inconsistent / UTF-8 interfaces aren't clearly documented

tehsausage commented 4 years ago

al_draw_text and related functions are not clear about the fact that they treat input strings as UTF-8

Although the documentation page makes many references to unicode, its not specified that the char*s passed to text drawing functions will be interpreted as UTF-8 strings, as opposed to an 8-bit character set.

Likewise, all functions that accept a filename are unclear about the fact that the filename string is treated as UTF-8, which is important on Windows where a utf8->utf16 conversion is done on the filename.

tehsausage commented 4 years ago

List of Allegro interfaces taking filenames as const char*:

* al_load_config_file
* al_save_config_file
* al_fopen
* al_fopen_interface
* al_make_temp_file

* al_create_path
* al_create_path_for_directory
* al_set_path_drive
* al_append_path_component
* al_insert_path_component
* al_replace_path_component
* al_set_path_filename
* al_set_path_extension

* al_create_fs_entry
* al_filename_exists
* al_remove_filename
* al_change_directory
* al_make_directory

* al_load_bitmap
* al_load_bitmap_flags
* al_save_bitmap
* al_identify_bitmap
* al_register_bitmap_loader (extension only)
* al_register_bitmap_saver (extension only)
* al_register_bitmap_loader_f (extension only)
* al_register_bitmap_saver_f (extension only)

* al_attach_shader_source_file

* al_register_sample_loader (extension only-ish)
* al_register_sample_loader_f (extension only)
* al_register_sample_saver (extension only-ish)
* al_register_sample_saver_f (extension only)
* al_register_audio_stream_loader (extension only-ish)
* al_register_audio_stream_loader_f (extension only)
* al_load_sample
* al_load_sample_f (extension only)
* al_load_audio_stream
* al_load_audio_stream_f (extension only)
* al_save_sample
* al_save_sample_f (extension_only)

* al_load_bitmap_font
* al_load_bitmap_font_flags
* al_load_ttf_font
* al_load_ttf_font_f
* al_load_ttf_font_stretch
* al_load_ttf_font_stretch_f

* al_open_video

* al_create_native_file_dialog (also filename related)
* al_get_native_file_dialog_path (also filename related)

List of Allegro interfaces returning filenames as as const char*:

* al_get_path_drive
* al_get_path_tail
* al_get_path_filename
* al_get_path_basename
* al_get_path_extension
* al_path_cstr

* al_get_fs_entry_name (UTF-8 is documented! ^1)
* al_get_current_directory

List of Allegro functions interpreting const char* strings as UTF-8 for rendering text:

* al_get_text_width
* al_draw_text
* al_draw_justified_text
* al_draw_textf
* al_draw_justified_textf
* al_draw_multiline_text
* al_draw_multiline_textf
* al_do_multiline_text

^1 al_get_fs_entry_name interestingly mentions issues where the filesystem might not be UTF-8, presumably on Linux when the locale is not set to a UTF-8 one:

Note that the filesystem encoding may not be known and the conversion to UTF-8 could in very rare cases cause this to return an invalid path.

Window titles are also another point where the encoding of strings inside of the program are relevant to the external environment.

As implemented, like with filenames, the Windows code-path is assuming the strings are UTF-8.

List of Allegro functions interpeting const char* as platform-dependant strings:

* al_set_window_title
* al_set_new_window_title
* al_get_new_window_title

And a similar story for other strings presented using native UI components:

* al_create_native_file_dialog
* al_show_native_message_box
* al_open_native_text_log
* al_append_native_text_log
* al_append_menu_item
* al_insert_menu_item
* al_get_menu_item_caption
* al_set_menu_item_caption

There are a lot of functions, so I think this could all be covered by a general statement on the highest level documentation page (currently the "Getting started guide").

As it is, the state of Allegro seems to be that: "All text drawing functions treat strings as UTF-8. All filesystem, and native UI functions treat strings as UTF-8 most of the time -- but sometimes not on some platforms.".

As far as I am aware, all of the current platforms Allegro targets generally use UTF-8 for all of these filesystem and native dialog strings, except for Windows, where Allegro uses a conversion routine to convert between UTF-8 and the native encoding.

That would make the only exception to the UTF-8 rule... Linux with a non-UTF-8 locale set?

Does it make sense to commit to a consistent UTF-8 interface for all platforms, i.e., Allegro doing native<->UTF-8 conversions wherever required for all current and future platforms -- which would probably mean adding appropriate conversions for non-UTF-8 Linux systems?

I think this would make sense, as there are functions in the library with both a "ustr" and "cstr" variant (e.g. al_draw_text and al_draw_ustr, al_path_cstr and al_path_ustr) -- and the "cstr" variant simply calls the "ustr" variant, implying that const char*s are already being generally treated identically to something that is explicitly tagged a UTF-8 string.

Another alternative to making UTF-8 the gold standard for Allegro's filesystem interface could be to add functions to query if const char * file paths are being interpreted as UTF-8 or something else:

e.g. bool al_file_paths_are_utf8(void)

And maybe adding a set of functions to convert between the native filesystem character set and UTF-8? Or changing ALLEGRO_PATH to always represent the native filesystem character set, similar to the role of std::filesystem::path in C++.

e.g.:

document that file extensions and file picker pattern strings should only use the 7-bit ASCII character set
deprecate al_get_fs_entry_name(), and add al_get_fs_entry_path() to replace it
add al_create_path_ustr() to create paths from utf-8 strings
add al_[set/append/replace]path[drive/component/filename/extension]_ustr() to operate with ALLEGRO_PATH using utf-8 strings
have al_path_ustr() actually do the conversion from native charset to UTF-8

This would give a reliable way to interact with the filesystem using UTF-8, but would leave the usage of all the interfaces in Allegro taking const char *, such as al_load_bitmap(), quite garbage to use:

ALLEGRO_BITMAP* bmp

if (al_file_paths_are_utf8())
{
    bmp = al_load_bitmap("💩.jpg");
}
else
{
    ALLEGRO_PATH* path = al_create_path_ustr(al_ref_cstr("💩.jpg"));
    bmp = al_load_bitmap(al_path_cstr(path));
    al_destroy_path(path);
}

beoran commented 4 years ago

Generally, I think that the idea is that the allegro api should be consistent and easy to use. Therefore, ideally, all API that use char * are supposed to be used with utf-8, since that is the universal character encoding. If there are functions that do not use utf-8 but some local encoding, then this should be changed to also use utf-8.

Allegro should then automatically convert the utf-8 to whatever is needed on the underlying platform, like how it is doing today for Windows.

A PR documenting this would be welcome.

tehsausage commented 4 years ago

I'm going to switch the title of the bug around since the problem that needs to be addressed is the lack of consistency in filename handling on different platforms.

Some discussion happened on IRC and here's a basic summary of how filenames in Allegro work on different platforms right now:

Filenames on Linux in Allegro are not treated specially as UTF-8 strings, however as most Linux systems are running with a UTF-8 charset locale, filenames are effectively UTF-8 as far as most people are concerned, and thus:

A filename obtained using the Allegro filesystem API is UTF-8, and passing it to al_draw_text will display the file path the way the user would see it in their file browser.
Filenames can be processed using the UTF-8 text functions in Allegro.
Source code is often written in UTF-8, and copy-pasting a file path from the system in to UTF-8 source code will just work.
Filenames received through external sources (argv, getenv) should all be UTF-8 encoded as well, and "just work" regardless of if you pass them to fopen() or al_fopen().

It is possible to have a Linux system using a non UTF-8 charset locale, however, and the result is:

A filename obtained using the Allegro filesystem API is not necessarily UTF-8, and passing it to al_draw_text will display a mangled representation of any non-ASCII characters.
Filenames cannot be processed using the UTF-8 text functions in Allegro.
Source code is often written in UTF-8, and copy-pasting a file path from the system in to UTF-8 source code will NOT work.
Filenames received through external sources (argv, getenv) should all be native locale encoded, and "just work" regardless of if you pass them to fopen() or al_fopen().

Filenames on Windows in Allegro ARE treated as if they were as UTF-8 strings (despite the system charset never actually being UTF-8), but Allegro does conversion to/from UTF-8 transparently, and the result is:

A filename obtained using the Allegro filesystem API is UTF-8 (because Allegro converts it to UTF-8), and passing it to al_draw_text will display the file path the way the user would see it in their file browser.
Filenames can be processed using the UTF-8 text functions in Allegro.
Source code is often written in UTF-8, and copy-pasting a file path from the system in to UTF-8 source code will just work.
How filenames are received through external sources will depend on how the user application is an "ANSI" or "UNICODE" executable:
- An ANSI program will receive system charset encoded filenames through argv, getenv, etc. This means a filename received through argv will "just work" if you pass it to fopen() (which interprets paths as native charset), but NOT work when passed to al_fopen() (which interprets paths as UTF-8).
- A UNICODE program will receive utf-16 / "wchar" encoded filenames through the unicode argv, getenv, etc. equivalents. A filename received from the system will "just work" if you pass it to the unicode equivalent of fopen: _wfopen() (which interprets paths as UTF-16) but be incompatible with al_fopen() (which cannot accept a wchar_t string).

The things that would break if Allegro's filenames were to always be interpreted as UTF-8 strings via conversion where required:

Firstly: An Allegro program might be taking filenames from external sources (argv, for example); and passing them through to al_fopen():

On a Windows system (where file paths are already always treated as UTF-8 by Allegro), al_fopen(argv[1]) doesn't work, as argv[1] will be encoded in the system's 8-bit character set (e.g. CP1252), and al_fopen() assumes UTF-8 file paths on Windows. Windows users already have to cope with doing a mbstring -> utf-8 conversion in order to use al_fopen safely with strings obtained from the environment.
On Linux systems with a UTF-8 locale, al_fopen(argv[1]) already just works, and would continue to just work (unless al_fopen were to also require valid UTF-8), and continue to be a bad idea, since its not portable to Windows.
On Linux systems with a non-UTF-8 locale, al_fopen(argv[1]) would break, despite presently working due to al_fopen() presently being "transparent" to the encoding of filenames.

Secondly: An Allegro program built for Linux will potentially change behavior depending on the current locale's character set when it is executed. This is a big deal because a program extracted on a system with a UTF-8 locale becomes potentially incompatible with a program extracted on a non-UTF-8 locale, and vice-versa. This is something that "just works" currently because of al_fopen()'s current "transparency" to filename encoding.

Thirdly: If the UTF-8 strings had to be valid, then Allegro programs running on Linux with a UTF-8 locale would become unable to use any files containing filenames which are invalid UTF-8 (such as a user drag+dropping a weirdly named image file in to an Allegro program that wishes to use al_load_bitmap or something on it). Most programs on Linux do not have this issue, because they treat filenames transparently.

The third one is the only real regression that a non-obscure system might experience. It could be avoided by not requiring that filenames necessarily be valid UTF-8 strings, but then philosophically, that would break the desired situation that character arrays passed to Allegro should always be (valid) UTF-8.

My current opinion:

As the "majority platform" (Windows) already cannot safely pass externally acquired filenames (e.g. argv[1]) directly through to al_fopen. Breaking the equivalent code on non-UTF-8 Linux systems (by having filenames passed to Allegro converted from UTF-8 to native strings) is probably not such a big deal, as that code was never portable to begin with. It would definitely be worthy of a release notes mention and a version bump, though.

However, going out of the way to break the ability to open some files on UTF-8 Linux systems (where its possible for filenames to exist that do not have a valid UTF-8 representation) by explicitly making al_fopen non-transparent, should be avoided if it can be determined that "all strings passed in/out of Allegro are valid UTF-8 strings" is not a desirable requirement. This would really just be a "theoretically broken according to the rules" kind of deal, unless someone adds assert(is_valid_utf8(filename)) to various parts of Allegro.

By poking a hole in the UTF-8-ness requirements of Allegro, and saying that strings do not necessarily have to be valid UTF-8, we would instead document that certain functions may have unexpected/unpredictable behavior if passed invalid UTF-8 strings. But it allows for UTF-8 locale Linux to continue to be able to work with all possible filenames.

Note: This is not an issue on Windows, because filesystem paths on Windows are always addressable by a sequence of Unicode characters.

My proposed changes to Allegro are:

Document that all strings passed in and out of allegro are to be interpreted as UTF-8, [and, optionally, that not all strings are necessarily valid UTF-8].
For non-UTF-8 POSIX systems, add transparent conversions from UTF-8 to native charset strings in all of the same places that Windows currently does a UTF-8 to wide string conversion, and vice-versa.
Optionally: Add two new functions that complement al_set_standard_file_interface() and al_set_standard_fs_interface(): al_set_native_file_interface() and al_set_native_fs_interface(). These would be identical to the standard interface, but bypass any UTF-8 conversions, for example:
- On Windows, with the standard file interface: al_fopen("Hello") is functionally: _wfopen(utf8toutf16("Hello"))
- On Windows, with the proposed native file interface: al_fopen("Hello") is functionally fopen("Hello") (note: Unfortunately, Windows wide character paths are using wchar_t instead of char, so cannot be passed through al_fopen. The "native" interface on Windows would be limited to working with filenames which are representable in the system's 8-bit character set)
- On Linux, with the standard file interface: al_fopen("Hello") is functionally fopen(utf8tombstring("Hello") (note: on utf-8 locales, utf8tombstring would be a no-op)
- On Linux, with the proposed native file interface: al_fopen("Hello") is functionally fopen("Hello")

Note that having al_setnative[file/fs]_interface() functions would imply that invalid UTF-8 strings should be allowed to pass through the Allegro library.

Allowing invalid UTF-8 strings probably has implications for things like Rust/Python wrappers.

Alternatively, if invalid UTF-8 is to be entirely disallowed:

Document that some files on the filesystem may be inaccessible to Allegro by string path names. al_get_fs_entry_name() already documents that some filenames may not be able to be returned due to them not being representable in UTF-8, and makes an allowance for a return value of NULL in that case. As implemented, though, the function never fails for encoding reasons, I believe.
Properly implement al_open_fs_entry() as the one magic feature that is able to open filenames that are invalid UTF-8 strings. This would require a change to its implementation, which currently effectively is equivalent to calling al_fopen() with the path name. Possible solutions: 1) The standard filesystem interface becomes "magic" in terms of coping with non-UTF-8 filenames on Linux in a way that allows al_open_fs_entry() to work. 2) ALLEGRO_FILE_INTERFACE changes to specify that fi_fopen's path parameter is not necessarily valid UTF-8 3) ALLEGRO_FILE_INTERFACE changes to add an optional fi_fopen_native function which takes a non UTF-8 pathname

PhysFS probably should be considered as well, since it is able to provide a view of the filesystem to Allegro, as well as provide access to archive formats where filenames may not necessarily be Unicode, even on Windows. I do not believe that the "All strings flowing through Allegro must be UTF-8" idea does hold, or should hold, when considering PhysFS or other potential custom file and filesystem interfaces.

beoran commented 4 years ago

First of all, about filenames, or other strings received through external sources: I think it's up to the programmer to make sure they are converted to UTF-8, since allegro can't know that a char * is a valid utf-8 string or not.

Linux system using a non UTF-8 charset locale are something that we didn't consider, in that case conversions in the underlying implementation of Allegro functions might be needed, although I assume nowadays few people are using such a set up of allegro5 + Linux + non-UTF-8 locale.

If you want to work on this I suggest you start with documenting the current behaviour. As a second step would be fixing the Allegro API so it works correctly on non UTF-8 Linux.

rmbeer commented 3 years ago

It works perfectly on Linux, UTF-8 was designed to be compatible with ASCII, or size in byte, and especially with "char*".

This is more of a Windows problem, I suggest changing the label of the issue.

liballeg / allegro5

Filesystem path UTF-8 handing is inconsistent / UTF-8 interfaces aren't clearly documented #1180