Character encoding conversion for OS3

BSzili commented 1 year ago

Since OS3 can't deal with UTF-8 file names it would be useful to have some kind of conversion inside filesysbox.library to handle at least a subset of international characters. I thought about using codesets.library, but it's probably too heavy for low end systems, and ISO-8859-1 <-> UTF-8 conversion should be sufficient in 90% of the cases. For example this simplified strlcpy variant (void return) that also does a Latin-1 to UTF-8 conversion:

static void strlcpy_lat1_utf8(char *dst, const char *src, size_t size)
{
    size_t len = 0;

    for (; *src; src++)
    {
        if (!(*src & 0x80))
        {
            len++;
            if (len >= size)
                break;
            *dst++ = *src;
        }
        else
        {
            len += 2;
            if (len >= size)
                break;
            *dst++ = (char)(0xc2 | ((unsigned char)(*src) >> 6));
            *dst++ = (char)(0xbf & *src);
        }
    }
    *dst++ = '\0';
}

What do you think? I could implement the reverse as well, and do the conversions when FBXF_ENABLE_UTF8_NAMES is enabled.

salass00 commented 1 year ago

The problem does not come from the local codeset to utf-8 conversion but the conversion the other way, as there you will need to find some way to deal with the characters that do not have an equivalent in the local codeset and it needs to be reversible (you need to be able to convert it back into the original utf-8).

I've attempted to tackle this problem with the local_to_utf8() and utf8_to_local() functions you can find here (utf8.c from current filesysbox sources): https://www.dropbox.com/s/mr378nnfq6rz3bj/utf8.c?dl=0

They are neither tested nor in use yet. As with your conversion function the API is similar to strlcpy().

BSzili commented 1 year ago

I was also thinking about this, as many functions in the FUSE API expect paths, and the conversion from UTF-8 to the local charset is potentially destructive. One solution suggested to me on EAB was to maintain the original UTF-8 paths internally, which would at least allow files with international names to be accessed on the share with a truncated name. Anyway, your solution with escaping the unmappable chars looks more elegant and should be more memory friendly.

BSzili commented 1 year ago

I did some testing for utf8_to_local / local_to_utf8 and it seems to work well. For 68k it would be useful to have a special case when maptable is NULL for fast(er) remapping into ISO-8859-1. E.g. I tested it with these modifications:

if (maptable == NULL)
{
    if (unicode >= 0x80 && unicode < 0x100)
    {
        local = unicode;
    }
}
else
{
    for (i = 0x80; i < 0x100; i++)
    {
        if (maptable[i] == unicode)
        {
            local = i;
            break;
        }
    }
}

and

if (maptable == NULL)
     unicode = (unsigned char)local;
else
     unicode = maptable[local];

salass00 commented 1 year ago

That is good to hear that they are working well for you.

I've just finished adding charset conversion in all file system operations that need it in the AmigaOS 4 filesysbox. The way it works is that internally filesysbox still uses utf-8 everywhere and conversion only happens when strings are passed to or from AmigaDOS. I haven't tested it yet but hopefully it should be working fine.

salass00 commented 1 year ago

The encode_unicode() function generated the first byte wrong in 3 and 4 byte utf-8 sequences (lazy copy/paste error on my part).

It should be changed to: seq[0] = 0xE0 | ((unicode >> 12) & 0x0F); for 3-byte sequences and: seq[0] = 0xF0 | ((unicode >> 18) & 0x07); for 4-byte sequences.

BSzili commented 1 year ago

Oops! Indeed, I used "árvíztűrő tükörfúrógép" as my test string and that only has 1 and 2 bytes characters. Using UTF-8 internally makes sense, it's better to avoid conversions unless they are absolutely necessary.

salass00 commented 1 year ago

I encountered the encode_unicode() bug because I tried to create a file name containing a euro sign in addition to some other non-ASCII symbols (euro is 0x20A4 in unicode).

I've started working on charset conversion for the AROS/AmigaOS 3 port of filesysbox in the newly created cconv_enabled branch: https://github.com/salass00/filesysbox/commits/cconv_enabled

salass00 commented 1 year ago

Charset conversion should now be fully implemented now in the cconv_enabled branch. I did some basic testing with an OS3.1 installation under fs-uae and it seemed to work (as the euro sign doesn't exist in latin-1 it was escaped into something like %CGA0).

What still remains is code to detect the system charset and get a mapping table to unicode (this should be done in the FbxSetupFS() function).

BSzili commented 1 year ago

Very cool, I'll do some testing tomorrow! For the OS3 system charset detection I think codesets.library's method should do, e.g. using the system language: https://github.com/jens-maus/libcodesets/blob/f53c6d135332bbb6170c07300386bb7028bde522/src/init.c#L130

salass00 commented 1 year ago

While testing the AROS build I noticed that charset conversion was missing from the ExAll directory scanning code (I left it for later because it was a little more work to add). This should be fixed now in the cconv_enabled branch.

BSzili commented 1 year ago

I just tried the conversion under 3.x and it seems to work, at least with the default Latin-1 mapping. Unfortunately the character set detection doesn't work yet for Hungary/Magyarország, as loc_CountryCode apparently contains license plate numbers: https://en.wikipedia.org/wiki/International_vehicle_registration_code#Current_codes I verified this to be the case in AROS as well: https://github.com/aros-development-team/AROS/tree/master/workbench/locale/countries The loc_LanguageName based detection also failed, but I haven't yet found the reason for this. edit: I think the issue is that loc_LanguageName contains the filename with the ".language" extension. I'll send a PR to fix both issues.

salass00 commented 1 year ago

Commit https://github.com/salass00/filesysbox/commit/1d50cbd3987a533bc3a4919e12dea055bc118f58 should already fix the issue with ".language" extension in loc_LanguageName.

BSzili commented 1 year ago

Nice. I fixed a few other issues, now the charset is detected fine for both the country and language, at least for Hungary.

BSzili commented 1 year ago

I'm closing the issue as the character conversion looks good to me.

salass00 / filesysbox

Character encoding conversion for OS3 #3