Hangs when pasting a string containing U+00A0(No-break space)

m-matsubara commented 4 months ago

Fixed a bug where U+00A0 (No-break space) was handled incorrectly, causing a hang when pasting U+00A0. In addition, all parts that are treated as #160 are changed to WideChar(#$00A0). ~~SynEdit.pas:8066 also corrected the wrong conditional expression. (or → and)~~

pyscripter commented 4 months ago

Committed the fix to DoShiftTabKey.

I cannot reproduce the issue of crash when pasting a string containing U+00A0. Could you please open an issue with detailed instructions for reproducing it?

I don't see how replacing #160 with WideChar(#$00A0) helps. Although it is used in some places in the source code the typecasting is redundant. I would probably settle for just #$A0, but then we should consistently enforce it throughout and I don't think it is worth the trouble.

From the Delphi docs:

A control string is a sequence of one or more **control characters**, each of which consists of the # symbol followed by an unsigned integer constant from 0 to 65,535 (decimal) or from $0 to $FFFF (hexadecimal) in UTF-16 encoding, and denotes the character corresponding to a specified code value. Each integer is represented internally by 2 bytes in the string.

pyscripter commented 4 months ago

Still cannot see the reason for change. e.g.

while (P^ >= #1) and ((P^ <= #32) or (P^ = #160))

is more consistent than

while (P^ >= #1) and ((P^ <= #32) or (P^ = #$00A0))

And in any case how is this related to the bug in the title?

m-matsubara commented 4 months ago

thank you.

To reproduce, simply paste the text containing U+00A0. (I wanted to include a text sample, but U+00A0 in the comment seemed to be converted to U+0020.)

The cause of the hangup is that case statement line 6108 in function TCustomSynEdit.TextWidth(P: PChar; Len: Integer): Integer; is not processed correctly, resulting in an infinite loop.

The type cast for WideChar is certainly unnecessary, but I wrote it in a similar way in function TCustomSynEdit.IsWordBreakChar(AChar: WideChar): Boolean;, so I adapted it. I removed the type cast as it is unnecessary. Note that #$A0 does not solve the problem, and it seems that you need to write #$00A0.

pyscripter commented 4 months ago

To reproduce, simply paste the text containing U+00A0.

I cannot reproduce this.

After copy paste:

Non breaking spaces are highlighted. Spaces are not.

m-matsubara commented 4 months ago

In the following case, if the character of P^ is U+00A0, the case on line 6108 will not be processed and will be processed with break in 6110.

If you modify #160 to #$00A0, the case on line 6108 will be handled.

6104:    while P < PEnd do
6105:    begin
6106:      case P^ of
6107:         #9: Inc(Result, fTabWidth * fCharWidth - Result mod (fTabWidth * fCharWidth));
6108:         #32..#126, #160: Inc(Result, FCharWidth);
6109:       else
6110:         break;
6111:       end;
6112:    end;

pyscripter commented 4 months ago

If you modify #160 to #$00A0, the case on line 6108 will be handled.

Is this a compiler bug or what?? Do you get this in both 32 bits and 64 bits? Which Delphi version are you using? As I said above I cannot reproduce it here (only tried Win64).

m-matsubara commented 4 months ago

sorry.

The following RSS-391 String with non-ASCII characters directly attached to a #xx or #$xx literal corrupts the final string may be the cause.

https://blogs.embarcadero.com/rad-studio-12-1-athens-patch-1-available/

I'll investigate. Please wait a moment.

m-matsubara commented 4 months ago

I tried it with Delphi 12 and Delphi 12.1 patch 1, but neither worked properly.
(Target platform is Windows 64bit)

pyscripter commented 4 months ago

Very strange. I am also using Delphi 12 with patch 1. (The patch fixes an unrelated issue).
Win32 or Win64? Any compiler options that may affect this?

pyscripter commented 4 months ago

Can you run the following console app?

program CharTest;

{$APPTYPE CONSOLE}

uses
  System.SysUtils;

var
  P : PChar;
  S: string;

begin
  S := #160;
  P := PChar(S);
  case P^ of
    #32: WriteLn('Space');
    #160: WriteLn('NB Space');
  end;

  ReadLn;
end.

What do you get?

MShark67 commented 4 months ago

I haven't had enough coffee yet, but could this have anything to do with {$HIGHCHARUNICODE ON/OFF}?

pyscripter commented 4 months ago

@MShark67 You are my hero! This was driving me crazy.

I haven't had enough coffee yet, but could this have anything to do with {$HIGHCHARUNICODE ON/OFF}?

This indeed might explain it. The docs say the default value is OFF. Is there a compiler option that affects this?

So in the Japanese ANSI Codepage #160 corresponds to another Unicode letter. (@m-matsubara could you please confirm this) while in my ANSI codepage and @MShark67 one ord(WideChar(#160)) = 160, so just by luck it is working OK.

pyscripter commented 4 months ago

I would suggest the following.

All character literals greater than 127 will be coded as #$xxxx.
The unnecessary WideChar typecast as in WideChar(#$00B4) will be removed.
This PR will be closed and will be replaced by a more comprehensive one covering all SynEdit units.

@m-matsubara @MShark67 Any volunteers for doing this?

m-matsubara commented 4 months ago

Is #160 treated as U+F8F0? It seems that when #160 is processed with MultiByteToWideChar, it becomes U+F8F0.

program Project1;

{$APPTYPE CONSOLE}

uses
  System.SysUtils;

var
  P : PChar;
  S: string;

begin
{$HIGHCHARUNICODE OFF}
  S := #160;
  P := PChar(S);
  case P^ of
    #32: WriteLn('Space ' + IntToHex(ord(P^)));
    #160: WriteLn('NB Space ' + IntToHex(ord(P^)));
    else WriteLn('else ' + IntToHex(ord(P^)));
  end;

{$HIGHCHARUNICODE ON}
  S := #160;
  P := PChar(S);
  case P^ of
    #32: WriteLn('Space ' + IntToHex(ord(P^)));
    #160: WriteLn('NB Space ' + IntToHex(ord(P^)));
    else WriteLn('else ' + IntToHex(ord(P^)));
  end;

{$HIGHCHARUNICODE OFF}
  S := #$00A0;
  P := PChar(S);
  case P^ of
    #32: WriteLn('Space ' + IntToHex(ord(P^)));
    #160: WriteLn('NB Space ' + IntToHex(ord(P^)));
    else WriteLn('else ' + IntToHex(ord(P^)));
  end;

{$HIGHCHARUNICODE ON}
  S := #$00A0;
  P := PChar(S);
  case P^ of
    #32: WriteLn('Space ' + IntToHex(ord(P^)));
    #160: WriteLn('NB Space ' + IntToHex(ord(P^)));
    else WriteLn('else ' + IntToHex(ord(P^)));
  end;

  ReadLn;
end.

pyscripter commented 4 months ago

It seems that when #160 is processed with MultiByteToWideChar, it becomes U+F8F0.

This explains everything.

m-matsubara commented 4 months ago

#160 is ascii (or ansi) character ? #$00A0 is Unicode character ?

pyscripter commented 4 months ago

#160 is ascii (or ansi) string ? #$00A0 is Unicode string ?

It will be clear if you reed the docs

m-matsubara commented 4 months ago

thunks. understood. Then #$00A0 seems appropriate.

pyscripter commented 4 months ago

Created new issue #95 Volunteers to provide a PR invited.

pyscripter / SynEdit

Hangs when pasting a string containing U+00A0(No-break space) #93