Improved UTF-8 support - Githubissues

benruijl commented 6 months ago

To allow for easier communication with other programs, FORM should support UTF-8 variable and expression names. Currently, it does (likely by accident) if one wraps the variable names in []:

S [xሴ];

L [Fɱ] = [xሴ]^2+2; * test π^2

Print +s;

.end

This program outputs:

FORM 4.3.1 (Apr 11 2023, v4.3.1) 64-bits         Run: Mon May 27 08:32:48 2024
    S [xሴ];

    L [Fɱ] = [xሴ]^2+2; * test π^2

    Print +s;

    .end

Time =       0.00 sec    Generated terms =          2
           [Fɱ]         Terms in output =          2
                         Bytes used      =         52

   [Fɱ] =
       + 2
       + [xሴ]^2
      ;

  0.00 sec out of 0.00 sec

As you can see, the message is not properly aligned, because the byte size does not equal the number of UTF-8 characters. The relevant code is in message.c is around line 704 is:

else if ( *s == 's' ) {
    u = va_arg(ap,char *);
    i = 0;
    while ( *u ) { i++; u++; }
    if ( i > x ) i = x;
    while ( x > i ) { *t++ = ' '; x--; }
    t += x;
    while ( --i >= 0 ) { *--t = *--u; }
    t += x;
}

where x is the desired width. One part of the fix is changing the counter i to exclude unicode repeat characters (and keeping track of the full byte count in a separate variable or a pointer difference):

while ( *u ) { if  ((*u & 0xC0) != 0x80) i++; u++; }

but extra care has to be taken that when the input name is larger than x, the truncation must happen at the start of a unicode character (a position satisfying *u & 0xC0) == 0x80). I haven't had time to fix that, so that's why I am posting the issue here.

In the comments you can report other places where UTF-8 support has to be added.

vermaseren commented 6 months ago

What do the various compilers with this?

On 26 May 2024, at 21:18, Ben Ruijl @.***> wrote:

To allow for easier communication with other programs, FORM should support UTF-8 variable and expression names. Currently, it does (likely by accident) if one wraps the variable names in []:

S [xሴ];

L [Fɱ] = [xሴ]^2+2;

Print +s;

.end This program outputs:

FORM 4.3.1 (Apr 11 2023, v4.3.1) 64-bits Run: Sun May 26 21:12:04 2024 S [xሴ];
L [Fɱ] = [xሴ]^2+2;

Print +s;

.end
Time = 0.00 sec Generated terms = 2 [Fɱ] Terms in output = 2 Bytes used = 52

[Fɱ] =
2
[xሴ]^2 ;

0.00 sec out of 0.00 sec As you can see, the message is not properly aligned, because the byte size does not equal the number of UTF-8 characters. The relevant code is:
      else if ( *s == 's' ) {
          u = va_arg(ap,char *);
          i = 0;
          while ( *u ) { i++; u++; }
          if ( i > x ) i = x;
          while ( x > i ) { *t++ = ' '; x--; }
          t += x;
          while ( --i >= 0 ) { *--t = *--u; }
          t += x;
      }
where x is the desired width. One part of the fix is changing the counter i to exclude unicode repeat characters:
while ( u ) { if (`((u & 0xC0) != 0x80)`) i++; u++; } but extra care has to be taken that when the input name is larger than x, the truncation must happen at the start of a unicode character (a position satisfying (*u & 0xC0) == 0x80)). I haven't had time to fix that, so that's why I am posting the issue here.

In the comments you can report other places where UTF-8 support has to be added.

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/528, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEROVTHGCGVRZJSOL3DZEIYR7AVCNFSM6AAAAABIKBCYJOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGMYTOOBZGY4TSMI. You are receiving this because you are subscribed to this thread.

benruijl commented 6 months ago

The compilers are agnostic to this, as for UTF-8 you can still use the regular char type. You just need to change some functions that involve string lengths.

Your terminal should render UTF-8 properly (you can try it with the example, but your own text editor may not properly render it).

vermaseren commented 6 months ago

I still do not see the need. Many things in the compiler part of FORM go by means of character table search to determine what type of character we have. My editor does not support it, and I need that editor for analysing very big files. Much bigger than vi and emacs can handle. And what good does it serve? That you can put your commentary in Chinese? Maybe the commentary actually has no problems with it.

You can do with Symbolica what you want, but I do not think the Russians and the Chinese are going to get you a lot of money.

Just my opinion of course, and a bit outspoken at that… See you later in the week.

Jos

On 27 May 2024, at 08:34, Ben Ruijl @.***> wrote:

The compilers are agnostic to this, as for UTF-8 you can still use the regular char type. You just need to change some functions that involve string lengths.

Your terminal should render UTF-8 properly (you can try it with the example, but your own text editor may not properly render it).

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/528#issuecomment-2132741127, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCERDR6FLEAWK2RCE7HDZELHYVAVCNFSM6AAAAABIKBCYJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSG42DCMJSG4. You are receiving this because you commented.

tueda commented 6 months ago

I guess math & physics people may prefer using Greek letters for variables, just like when they write equations by hand. This is possible in languages such as Python, Julia and Mathematica.

vermaseren commented 6 months ago

Yes, but the habit is to write that as phi, rather than the actual \phi which is not on most keyboards anyway. And the dictionaries in Form can make that you can print it almost any way you want it.

On 27 May 2024, at 13:52, Takahiro Ueda @.***> wrote:

I guess math & physics people may prefer using Greek letters for variables, just like when they write equations by hand. This is possible in languages such as Python, Julia and Mathematica.

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/528#issuecomment-2133317224, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEWBZ3E6OK4JOSOCUA3ZEMNATAVCNFSM6AAAAABIKBCYJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGMYTOMRSGQ. You are receiving this because you commented.

tueda commented 6 months ago

Usually, they are typed using a software keyboard... And the dictionaries may fail to handle Unicode characters, I guess?

vermaseren commented 6 months ago

About the dictionaries, that would be a very limited change, because the characters in the dictionaries have basically no meaning for Form. But the following seems more serious: All characters in Form have a type assigned to them. Like in FG.cTable. If you would have to assign types to most of the unicode characters, you will be busy for quite some time. I see it as an unnecessary feature that is in principle totally irrelevant and only costs a lot of manpower to implement properly.

On 27 May 2024, at 14:00, Takahiro Ueda @.***> wrote:

Usually, they are typed using a software keyboard... And the dictionaries may fail to handle Unicode characters, I guess?

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/528#issuecomment-2133330500, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEWEBKGXZMNXCDLAFNTZEMN63AVCNFSM6AAAAABIKBCYJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGMZTANJQGA. You are receiving this because you commented.

benruijl commented 6 months ago

The FG.cTable is skipped for all names that start with [, so these issues are avoided. I think one of the few changes is the string length for the alignment.

vermaseren commented 6 months ago

If you are only talking about names between [ ] then of course things might be possible although I do not really see the fun of it. I am more worried about other things that could be affected. Like using a variable with a name that is a greek character. A real greek character. Or whatever.

On 27 May 2024, at 14:17, Ben Ruijl @.***> wrote:

The FG.cTable is skipped for all names that start with [, so these issues are avoided. I think one of the few changes is the string length for the alignment.

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/528#issuecomment-2133360105, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEU2RIBDUFCQAA35F5DZEMP4VAVCNFSM6AAAAABIKBCYJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGM3DAMJQGU. You are receiving this because you commented.

benruijl commented 6 months ago

I am thinking of a cross-tool unified format that can be used to import and export symbol definitions and expressions between Form/Mathematica/Symbolica. For this format it would be great if unicode symbols can be understood by all tools. So for example, if you use π in Mathematica, it can be converted to [π] in Form.

vermaseren commented 6 months ago

If that is the only feature you need, I guess somebody could have a look at it. But I do not have the time and motivation for it. My personal problem with this is that my editor cannot handle it either. Of course things are much easier if they are built in from the start, as you may have done with Symbolica, but Form is originally from the 80’s when these problems were still way past the horizon. In addition Form is not quite the interactive workbench that Mathematica wants to be. I also never saw any reason to put in ‘pretty printing’. That just forces you to put in a lot of work, and for the target group of Form it makes little difference. Btw, Form 5.0 will have pi_.

On 27 May 2024, at 14:27, Ben Ruijl @.***> wrote:

I am thinking of a cross-tool unified format that can be used to import and export symbol definitions and expressions between Form/Mathematica/Symbolica. For this format it would be great if unicode symbols can be understood by all tools. So for example, if you use π in Mathematica, it can be converted to [π] in Form.

— Reply to this email directly, view it on GitHub https://github.com/vermaseren/form/issues/528#issuecomment-2133378489, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJPCEXXFXU4G7VYLAUXSWDZEMRD7AVCNFSM6AAAAABIKBCYJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZTGM3TQNBYHE. You are receiving this because you commented.

vermaseren / form

Improved UTF-8 support #528