Add optional UTF-8 Display/File character support.

taviso commented 2 years ago

Lotus 1-2-3 predates UTF-8, and uses LMBCS internally, which is sort of a precursor to unicode.

I see no reason we couldn't add a UTF-8 option for file/display charset, for better i18n support. It supports character set translation, we just have to teach it how and figure out the CBD (character bundle) format. I already know the BDLREC format, from my lotusdrv project - it's basically a TLV (tag, length, value) encoding system.

sjuswede commented 2 years ago

I sometimes work with Chinese and Japanese text, and if this could get working, I'd be extremely happy.

Right now not even Swedish characters like åäö work for me in rxvt-unicode or XTerm, when I try to enter them. When I import a csv containing them (in UTF8) they predictably get stripped out.

taviso commented 2 years ago

Yeah, not great right now, I can't even use £ lol. I looked at the code a bit today, I think I can make a few improvements easily, some might be harder though!

Internally, lotus uses LMBCS, which is actually pretty impressive foresight considering unicode wasn't invented and everyone else was using codepages. This is good, because internally it can tell the difference between åäa.

You can see it knows about å, and calls it a ring:

https://archive.org/details/lotus-1-2-3-release-3.1-reference/Lotus%201-2-3%20Release%203.1%20-%20Reference/page/n637/mode/2up

It stores these characters correctly but doesn't know how to display them, so right now it uses a "fallback" ascii character translation table (å => a and £ => L, and so on). That actually seems pretty easy to solve, I'll just add a lmbcs => utf-8 table, then pass it to waddch() instead.

I'll give it a shot this weekend.

taviso commented 2 years ago

I think display and keyboard input might be easy, but the question is what to do with /File Import, always assume UTF-8? I guess we could have an environment variable like $LOTUS_IMPORT_CHARSET or whatever.

sjuswede commented 2 years ago

An environment variable would of course be great for legacy files. I would default to UTF-8, since that is standard in Linux today. It's a lot of work to set a normal distro to use anything else. But there are a lot of legacy files out there, and many systems which still spit out very strange formats. Don't ask me how I know.

taviso commented 2 years ago

Okay, I think I've got a plan. I have an easy temporary improvement, and a plan for a harder complete solution.

I can change the keymap code to translate UTF-8 on input to all the supported lmbcs characters. There are no collisions (I checked) so this will be super easy, I can do this in a day or two.

This is easy but not a complete solution -- there's no cjk for a start... but it is better than nothing - most of the latin extended characters are covered (so I'll get £, you'll get all the Swedish characters, things like éßçñ are all there). There is no €, but it has ¤, it seems pretty safe to just steal that for € for now? I don't know.

The complete solution will be adding lmbcs<->UTF-8 charset support, but this is a much bigger job.

krackout commented 1 year ago

The complete solution will be adding lmbcs<->UTF-8 charset support, but this is a much bigger job.

@taviso If any help can be given, I'm willing; especially regarding Greek. It may be a waste of time to get me programmatically involved, but it'll be easier regarding conversion tables I suppose.

taviso commented 1 year ago

Thank you! I'm slowly working on this, it will work eventually! 😆

taviso / 123elf

Add optional UTF-8 Display/File character support. #73