liballeg / allegro5

The official Allegro 5 git repository. Pull requests welcome!
https://liballeg.org
Other
1.89k stars 285 forks source link

Filesystem path UTF-8 handing is inconsistent / UTF-8 interfaces aren't clearly documented #1180

Open tehsausage opened 4 years ago

tehsausage commented 4 years ago

al_draw_text and related functions are not clear about the fact that they treat input strings as UTF-8

Although the documentation page makes many references to unicode, its not specified that the char*s passed to text drawing functions will be interpreted as UTF-8 strings, as opposed to an 8-bit character set.

Likewise, all functions that accept a filename are unclear about the fact that the filename string is treated as UTF-8, which is important on Windows where a utf8->utf16 conversion is done on the filename.

tehsausage commented 4 years ago

List of Allegro interfaces taking filenames as const char*:

* al_load_config_file
* al_save_config_file
* al_fopen
* al_fopen_interface
* al_make_temp_file

* al_create_path
* al_create_path_for_directory
* al_set_path_drive
* al_append_path_component
* al_insert_path_component
* al_replace_path_component
* al_set_path_filename
* al_set_path_extension

* al_create_fs_entry
* al_filename_exists
* al_remove_filename
* al_change_directory
* al_make_directory

* al_load_bitmap
* al_load_bitmap_flags
* al_save_bitmap
* al_identify_bitmap
* al_register_bitmap_loader (extension only)
* al_register_bitmap_saver (extension only)
* al_register_bitmap_loader_f (extension only)
* al_register_bitmap_saver_f (extension only)

* al_attach_shader_source_file

* al_register_sample_loader (extension only-ish)
* al_register_sample_loader_f (extension only)
* al_register_sample_saver (extension only-ish)
* al_register_sample_saver_f (extension only)
* al_register_audio_stream_loader (extension only-ish)
* al_register_audio_stream_loader_f (extension only)
* al_load_sample
* al_load_sample_f (extension only)
* al_load_audio_stream
* al_load_audio_stream_f (extension only)
* al_save_sample
* al_save_sample_f (extension_only)

* al_load_bitmap_font
* al_load_bitmap_font_flags
* al_load_ttf_font
* al_load_ttf_font_f
* al_load_ttf_font_stretch
* al_load_ttf_font_stretch_f

* al_open_video

* al_create_native_file_dialog (also filename related)
* al_get_native_file_dialog_path (also filename related)

List of Allegro interfaces returning filenames as as const char*:

* al_get_path_drive
* al_get_path_tail
* al_get_path_filename
* al_get_path_basename
* al_get_path_extension
* al_path_cstr

* al_get_fs_entry_name (UTF-8 is documented! ^1)
* al_get_current_directory 

List of Allegro functions interpreting const char* strings as UTF-8 for rendering text:

* al_get_text_width
* al_draw_text
* al_draw_justified_text
* al_draw_textf
* al_draw_justified_textf
* al_draw_multiline_text
* al_draw_multiline_textf
* al_do_multiline_text

^1 al_get_fs_entry_name interestingly mentions issues where the filesystem might not be UTF-8, presumably on Linux when the locale is not set to a UTF-8 one:

Note that the filesystem encoding may not be known and the conversion to UTF-8 could in very rare cases cause this to return an invalid path.


Window titles are also another point where the encoding of strings inside of the program are relevant to the external environment.

As implemented, like with filenames, the Windows code-path is assuming the strings are UTF-8.

List of Allegro functions interpeting const char* as platform-dependant strings:

* al_set_window_title
* al_set_new_window_title
* al_get_new_window_title

And a similar story for other strings presented using native UI components:

* al_create_native_file_dialog
* al_show_native_message_box
* al_open_native_text_log
* al_append_native_text_log
* al_append_menu_item
* al_insert_menu_item
* al_get_menu_item_caption
* al_set_menu_item_caption

There are a lot of functions, so I think this could all be covered by a general statement on the highest level documentation page (currently the "Getting started guide").

As it is, the state of Allegro seems to be that: "All text drawing functions treat strings as UTF-8. All filesystem, and native UI functions treat strings as UTF-8 most of the time -- but sometimes not on some platforms.".


As far as I am aware, all of the current platforms Allegro targets generally use UTF-8 for all of these filesystem and native dialog strings, except for Windows, where Allegro uses a conversion routine to convert between UTF-8 and the native encoding.

That would make the only exception to the UTF-8 rule... Linux with a non-UTF-8 locale set?

Does it make sense to commit to a consistent UTF-8 interface for all platforms, i.e., Allegro doing native<->UTF-8 conversions wherever required for all current and future platforms -- which would probably mean adding appropriate conversions for non-UTF-8 Linux systems?

I think this would make sense, as there are functions in the library with both a "ustr" and "cstr" variant (e.g. al_draw_text and al_draw_ustr, al_path_cstr and al_path_ustr) -- and the "cstr" variant simply calls the "ustr" variant, implying that const char*s are already being generally treated identically to something that is explicitly tagged a UTF-8 string.


Another alternative to making UTF-8 the gold standard for Allegro's filesystem interface could be to add functions to query if const char * file paths are being interpreted as UTF-8 or something else:

e.g. bool al_file_paths_are_utf8(void)

And maybe adding a set of functions to convert between the native filesystem character set and UTF-8? Or changing ALLEGRO_PATH to always represent the native filesystem character set, similar to the role of std::filesystem::path in C++.

e.g.:

This would give a reliable way to interact with the filesystem using UTF-8, but would leave the usage of all the interfaces in Allegro taking const char *, such as al_load_bitmap(), quite garbage to use:

ALLEGRO_BITMAP* bmp

if (al_file_paths_are_utf8())
{
    bmp = al_load_bitmap("💩.jpg");
}
else
{
    ALLEGRO_PATH* path = al_create_path_ustr(al_ref_cstr("💩.jpg"));
    bmp = al_load_bitmap(al_path_cstr(path));
    al_destroy_path(path);
}
beoran commented 4 years ago

Generally, I think that the idea is that the allegro api should be consistent and easy to use. Therefore, ideally, all API that use char * are supposed to be used with utf-8, since that is the universal character encoding. If there are functions that do not use utf-8 but some local encoding, then this should be changed to also use utf-8.

Allegro should then automatically convert the utf-8 to whatever is needed on the underlying platform, like how it is doing today for Windows.

A PR documenting this would be welcome.

tehsausage commented 4 years ago

I'm going to switch the title of the bug around since the problem that needs to be addressed is the lack of consistency in filename handling on different platforms.

Some discussion happened on IRC and here's a basic summary of how filenames in Allegro work on different platforms right now:

Filenames on Linux in Allegro are not treated specially as UTF-8 strings, however as most Linux systems are running with a UTF-8 charset locale, filenames are effectively UTF-8 as far as most people are concerned, and thus:

It is possible to have a Linux system using a non UTF-8 charset locale, however, and the result is:

Filenames on Windows in Allegro ARE treated as if they were as UTF-8 strings (despite the system charset never actually being UTF-8), but Allegro does conversion to/from UTF-8 transparently, and the result is:


The things that would break if Allegro's filenames were to always be interpreted as UTF-8 strings via conversion where required:

Firstly: An Allegro program might be taking filenames from external sources (argv, for example); and passing them through to al_fopen():

Secondly: An Allegro program built for Linux will potentially change behavior depending on the current locale's character set when it is executed. This is a big deal because a program extracted on a system with a UTF-8 locale becomes potentially incompatible with a program extracted on a non-UTF-8 locale, and vice-versa. This is something that "just works" currently because of al_fopen()'s current "transparency" to filename encoding.

Thirdly: If the UTF-8 strings had to be valid, then Allegro programs running on Linux with a UTF-8 locale would become unable to use any files containing filenames which are invalid UTF-8 (such as a user drag+dropping a weirdly named image file in to an Allegro program that wishes to use al_load_bitmap or something on it). Most programs on Linux do not have this issue, because they treat filenames transparently.

The third one is the only real regression that a non-obscure system might experience. It could be avoided by not requiring that filenames necessarily be valid UTF-8 strings, but then philosophically, that would break the desired situation that character arrays passed to Allegro should always be (valid) UTF-8.


My current opinion:

As the "majority platform" (Windows) already cannot safely pass externally acquired filenames (e.g. argv[1]) directly through to al_fopen. Breaking the equivalent code on non-UTF-8 Linux systems (by having filenames passed to Allegro converted from UTF-8 to native strings) is probably not such a big deal, as that code was never portable to begin with. It would definitely be worthy of a release notes mention and a version bump, though.

However, going out of the way to break the ability to open some files on UTF-8 Linux systems (where its possible for filenames to exist that do not have a valid UTF-8 representation) by explicitly making al_fopen non-transparent, should be avoided if it can be determined that "all strings passed in/out of Allegro are valid UTF-8 strings" is not a desirable requirement. This would really just be a "theoretically broken according to the rules" kind of deal, unless someone adds assert(is_valid_utf8(filename)) to various parts of Allegro.

By poking a hole in the UTF-8-ness requirements of Allegro, and saying that strings do not necessarily have to be valid UTF-8, we would instead document that certain functions may have unexpected/unpredictable behavior if passed invalid UTF-8 strings. But it allows for UTF-8 locale Linux to continue to be able to work with all possible filenames.

Note: This is not an issue on Windows, because filesystem paths on Windows are always addressable by a sequence of Unicode characters.


My proposed changes to Allegro are:

Note that having al_setnative[file/fs]_interface() functions would imply that invalid UTF-8 strings should be allowed to pass through the Allegro library.

Allowing invalid UTF-8 strings probably has implications for things like Rust/Python wrappers.

Alternatively, if invalid UTF-8 is to be entirely disallowed:

PhysFS probably should be considered as well, since it is able to provide a view of the filesystem to Allegro, as well as provide access to archive formats where filenames may not necessarily be Unicode, even on Windows. I do not believe that the "All strings flowing through Allegro must be UTF-8" idea does hold, or should hold, when considering PhysFS or other potential custom file and filesystem interfaces.

beoran commented 4 years ago

First of all, about filenames, or other strings received through external sources: I think it's up to the programmer to make sure they are converted to UTF-8, since allegro can't know that a char * is a valid utf-8 string or not.

Linux system using a non UTF-8 charset locale are something that we didn't consider, in that case conversions in the underlying implementation of Allegro functions might be needed, although I assume nowadays few people are using such a set up of allegro5 + Linux + non-UTF-8 locale.

If you want to work on this I suggest you start with documenting the current behaviour. As a second step would be fixing the Allegro API so it works correctly on non UTF-8 Linux.

rmbeer commented 3 years ago

It works perfectly on Linux, UTF-8 was designed to be compatible with ASCII, or size in byte, and especially with "char*".

This is more of a Windows problem, I suggest changing the label of the issue.