irmen / prog8

high level programming language and compiler targeting 6502 machines such as the C-64 and CommanderX16
https://prog8.readthedocs.io/
Other
152 stars 18 forks source link

Support for new encodings #129

Closed adiee5 closed 7 months ago

adiee5 commented 8 months ago

Recently, there were new encodings introduced in Commander X16. They will be accessible in the r47 release by using cx16.screen_set_charset() kernal function. 2 of them are ISO based - they are iso-8859-5 for Cyrillic script and iso-8859-16 for Eastern/Balkan Latin orthographies (It supports Croatian, Polish, Hungarian etc.).

Iso-16 is really just Iso-15 with some characters being replaced with different characters, that didn't exist in iso-15. Therefore, it should probably be handled by the iso: encoding, you probably just need to add new unicode values to your lookup tables that you use for utf-8 -> iso conversions. The Cyrillic one might not be as clean, because Cyrillic iso-5 has '§' symbol, which's also present in iso-15, but under a different codepoint. So i guess you could either make § always convert to iso-15 (which will generate garbage when in cyrillic), make prog8 assume what code point should it use depending on the other letters or i guess make a separate encoding. It's up to you to decide what approach you'll go with for this one.

The last encoding that was added is cp437. This was the default encoding on Original IBM PC. It was added to X16 for retro networking and compatibility with software written for IBM PC. As I'm writing this, there wasn't any kind of kernal functionality implemented for it yet and because of how is cp437 laid out, it's probably dangerous to actually use this encoding (or at least using letters in ranges $01-$1F and $80-$9F is dangerous, i think you exactly know why). x16-rom reffers to it internally as ANSI. I think this name isn't really representative (especially since ANSI is usually used for referring to windows-125x encodings), so i guess you could call it ibm: or sth, or cp437:.

irmen commented 8 months ago

iso encoding really is ISO-8859-15 it uses Kotlin's (or Java's, if you will) character encoding, there's no translation table in the compiler for it.

if we would like to support the newer encodings, they're going to have to get their own unique name always. (And I guess the other 2 iso encodings you mentioned are also available as built-in charset in Kotlin)

adiee5 commented 8 months ago

iso encoding really is ISO-8859-15 it uses Kotlin's (or Java's, if you will) character encoding, there's no translation table in the compiler for it.

Oh, ok, makes sense

(And I guess the other 2 iso encodings you mentioned are also available as built-in charset in Kotlin)

yeah, all three should, as first 2 are just iso standards, while the 3rd was also popular

but if you rely on Kotlin's converters, how do you handle alt glyphs? if i remember correctly, prog8 can handle multiple characters as the same 8bit letter (or maybe that's the case only in petscii? 🤔)

adiee5 commented 8 months ago

anyways, if we're going with names, then cyrrilic one shoud be named isocyr as it says exactly what it is. For the balkan/eastern latin, the best choice is probably iso16, as it's unambiguous (I still think it should be merged with iso, because there's barely any difference, but yeah, you can add it as a separate encoding if you prefer that way).

For the third, idk, official cx16 documentation calls it ANSI, but it really ambiguous and doesn't even make sense. I mean, I remember windows encodings being refered to as ANSI. We may call it ibm or msdos i guess. or even use its actual name cp437

irmen commented 7 months ago

So how do we tell the screen editor that it should use the new encodings?

I mean, for petscii there is CTRL-N to switch to petscii lowercase for ISO there is CTRL-O to switch to Iso

Without this, just printing (and reading) text to the screen will not yield the correct results because the character codes still go through the wrong translation table

adiee5 commented 7 months ago

We have to be in iso, then we execute cx16.screen_set_charset(az,0) where az corresponds to a font id. 7 loads an IBM font, 8 cyrrilic ISO, 10 balkan/eastern latin ISO. New ISO charsets will just work with existing kernal and prog8 APIs (except file system io), as they behave exactly the same way as the ISO 15 codepage therefore don't require additional handling. CP437 will kinda work, but not really reliably. as kernal currently isn't really aware of the CP437 (it's not really aware of other new encodings either, but as i said, they don't really require additional handling in kernal except for fs io)

additionally, there is also an extapi function called iso_cursor_char, which ideally should be used after changing the charset to cp437, so that the cursor doesn't look like the letter f. ISO encodings obviously don't have that problem.

irmen commented 7 months ago

I've just pushed the change that adds these new encodings : cp437 iso5 iso16 including a new example program cx16/charsets.p8 that uses them

adiee5 commented 7 months ago

to be fair, iso5 could be called isocyr as that's still unambiguous, but anyways, it's still nice

irmen commented 7 months ago

we can still vote on the prefix names! :)