thomasokken / free42

Free42 : An HP-42S Calculator Simulator
https://thomasokken.com/free42/
GNU General Public License v2.0
280 stars 54 forks source link

Pasting numbers in Windows #48

Closed salvis closed 2 years ago

salvis commented 2 years ago

We've talked about the decimal separator and digit grouping before, and I accept your stance on keeping Free42 the same as the HP original.

In my locale, numbers are formatted as 1'234'567.89, and that's what my Windows settings reflect. However, because HP supports only either the dot and the comma or the comma and the dot for decimal separator and digit grouping, I've chosen to use the more prominent comma as my decimal point some 45 years ago. I work in FIX mode most of the time, and the comma serves as a constant reminder to me that the punctuation is backwards. I've learned to live with that.

However, when I paste "1'234'567.89" into Free42, what do I get? — I get "1,00"! This is, er, less than ideal.

Actually, it's a real pain having to paste my number somewhere, remove the quotes, copy and paste it again, and what do I get? — "123.456.789,00"! I have to change my decimal point into a comma, too.

I understand that this is working exactly as you designed it, but you're not constrained by HP's prior art here, and you could do much better. Here's my proposal:

  1. When you get a string from the clipboard, check the environment for the digit grouping character and remove all of these.
  2. Parse the number as you do now, but interpret the first comma, dot, or any other characters listed on https://en.wikipedia.org/wiki/Decimal_separator as the decimal separator.
  3. When copying a number to the clipboard, format it according to the number format defined in the environment.

I believe this would make a lot of people happy, and I can't think of any reason why anyone would prefer the current design.

thomasokken commented 2 years ago

I did consider using the system's locale for parsing the clipboard contents while pasting, and formatting numbers while copying.

I rejected that option because that isn't always what's needed. For example, I spend a lot of my time programming, and in most programming languages, the decimal is the period, regardless of locale, and so, with my locale set to the Netherlands, I would get the wrong decimal when copying and pasting between Free42 and my source code.

The current arrangement has the advantage of being easy to understand: what you see is what you get, always. Any deviation from that behavior would have to be a user-selectable option.

As I understand it, most problems go away as long as the setting of flag 28 matches the system's locale, and Free42 tries to help with that by setting flag 28 accordingly on cold start.

The only remaining issue would appear to be the thousands separators while pasting. That could be dealt with in a compatible manner by checking if the decimal defined by the locale matches the decimal defined by flag 28, and if so, accept the locale's thousands separators instead of "period in RDX, mode and comma in RDX. mode."

thomasokken commented 2 years ago

It looks like the conversions that take place in the logic for copying and pasting scalars and matrices are reasonably well separated from the ones used while copying and pasting programs and lists, so I could add an option to do locale-aware conversions for only those types, without breaking the other formats.

Handling ' shouldn't be an issue once I generalize the number scanning and conversion logic. Handling "thin space" or even regular space as thousands separators might be a bridge too far, though, I'll have to check the logic that deals with all the different complex number formats to see how well it can handle those variations...

salvis commented 2 years ago

That would be fantastic, thank you very much for considering it!

Even though I've been a software developer for all my life, I've hardly ever had a need for pasting calculated numbers, and even rarer floating point numbers, into source code. That's different for you, of course, I realize that, but you're not exactly in mainstream software development. :-)

Or you could get really fancy and have Shift-Ctrl-C for copying a user-defined number format to the clipboard.

thomasokken commented 2 years ago

https://github.com/thomasokken/free42/compare/4cba40582af61dee591fba3028ae6d73efcfd6c7...b4ae7c246933596c7f697addb75662bd06e5b604

salvis commented 2 years ago

Wow, thank you so much!

thomasokken commented 2 years ago

No problem!

I realized, after creating the release, that there is a potential problem with the new logic in the Windows and Linux versions, namely, they take only the first byte of the decimal and thousands separators, as returned by the locale APIs. This is fine when those characters are simple ASCII period, comma, space, or single quote, but it probably won't do the right thing if the thousands separator is the Unicode thin space, U+2009. The Win32 API documentation doesn't mention how this is handled, or if it is even an allowed value at all.

The MacOS, iOS, and Android versions should be able to handle the thin space or other Unicode characters, since they use Unicode-compatible APIs, that is, the locale information is passed in variables of type NSString or java.lang.String, which provide Unicode support out of the box.

The new logic in Free42 and Plus42 will accept all kinds of spaces while parsing numbers (see the ascii2hp() function for details), and will always generate U+2009 when formatting numbers if the thousands separator is a space. But it will be interesting to see whether all of this works in Windows and Linux environments where the locale has U+2009 as the thousands separator.

thomasokken commented 2 years ago

Ah, my bad, the Win32 API docs mention that the encoding depends on the function you call, GetLocaleInfoA() or GetLocaleInfoW(). Free42 and Plus42 should be calling GetLocaleInfoW() and then performing the UTF-16 to UTF-8 conversion, whereas right now they're calling GetLocaleInfoA() and performing no conversion. This means that a Windows-1252 encoded string is fed to ascii2hp(). When space is selected as the thousands separator, GetLocaleInfoA() returns 0xa0, and ascii2hp() drops that because it's not valid UTF-8. Oops.

I'll fix that. Fortunately the fix is pretty simple.

Now to figure out what the situation is in Linux...

salvis commented 2 years ago

Yes, but why do you want to convert to UTF-8? Especially since you want to access 4 code points by indexing?

In UTF-8 any code point within the ASCII range will take one byte, and outside the ASCII range two, three, or more bytes. That is a royal pain to work with. Why don't you stick with UTF-16? In UTF-16 you're dealing with wchar_t, which is always 2 bytes.

Also, in Win32, when getting from the clipboard, if you want to support Unicode, you specifically have to use GetClipboardDataW(CF_UNICODETEXT), which again gives you wchar_t. And for posting SetClipboardDataW(CF_UNICODETEXT, hMem), encoded as wchar_t.

thomasokken commented 2 years ago

I don't want to use Unicode internally, but I do need to deal with the possibility that the separator character could be set to a non-ASCII character in the locale. The 3.0.12 code doesn't deal with that in the Windows and Linux versions.

It's a bit complicated to explain, but take a look at the code changes I just made to deal with the issue, they're pretty straightforward:

https://github.com/thomasokken/free42/compare/fc4cabcf052f7c37eb009163b47fb1740b4eea37...e09dd90935b45ee6b9b6f675007ad2abd4d1ff7d

thomasokken commented 2 years ago

The core uses the HP-42S character set; the core/shell interface uses UTF-8, and the core converts between the two using the hp2ascii() and ascii2hp() functions. (The names of those functions made more sense years ago, before I started adding Unicode support.) And in the Windows version, there is an additional conversion step, between UTF-8 and UTF-16, because the Windows APIs use UTF-16.

It's actually a bit more complicated than that, because core_copy() and core_paste() need to be able to deal with separator characters that have no equivalents in the HP-42S character set (tab, CR, Unicode spaces), which makes the encoding and decoding in those functions a bit messy.

salvis commented 2 years ago

I haven't done any C programming for 40 years, but I think you could save yourself a lot of head scratching by using native APIs. Win32 has https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getnumberformatex and some hints here: https://stackoverflow.com/questions/68957018/whats-the-proper-way-to-use-numberfmt-with-getnumberformatex https://www.codeproject.com/Questions/1173841/How-to-internationalisation-i-n-an-number-in-vcplu

I'm sure Linux also has something along these lines. Rolling your own is a challenge, because it's hard to enumerate all the variants, and it's even harder to tell whether they look OK without having a native OS.

And in the Windows version, there is an additional conversion step, between UTF-8 and UTF-16, because the Windows APIs use UTF-16.

Well, all the parts are in UTF-16 as well as the end result. So, why not do it all in UTF-16 instead of converting all the parts to UTF-8 and suffering all that pain?

For pasting into Free42, try GetClipboardData() with CF_TEXT rather than CF_UNICODETEXT. That leaves the heavy lifting to Windows and you'll get nice SBCS characters. https://stackoverflow.com/questions/14762456/getclipboarddatacf-text

thomasokken commented 2 years ago

Using Windows APIs for number formatting and parsing isn't useful, for two reasons: First, this isn't a Windows app, it's a cross-platform app that currently runs on five different operating systems. Unless all five offer equivalent functionality, it tends to be easier to put the logic in the Free42 core. And second, the number formatting and parsing functions provided by operating systems and libraries don't support bid128.

thomasokken commented 2 years ago

Same goes for using UTF-16 internally. Sure, it would simplify the Windows version a bit, but all the logic I would be able to remove there, I would have to add back four times, in the other versions.

thomasokken commented 2 years ago

Passing CF_TEXT to GetClipboardData() instead of CF_UNICODETEXT doesn't work because that gives you text in Windows-1252 encoding, and there are several characters in the HP-42S character set that don't exist in Windows-1252. You really need full Unicode support for copy and paste.

thomasokken commented 2 years ago

The good news is that supporting different separators and different digit group sizes is actually very easy. The modifications I had to make to my number parsing and formatting code were very minor. Most of the work went into adding the check boxes to the preferences dialogs, and writing the code that reads the locale information and rearranges it in a uniform manner so the core can use it. I think I have everything covered now, except for using Arabic digits in Arabic locales and things like that. The decimal characters . and , and the grouping characters . , ' space are all taken care of, and even the weird Indian group sizes.