mthom / scryer-prolog

A modern Prolog implementation written mostly in Rust.
BSD 3-Clause "New" or "Revised" License
2.08k stars 125 forks source link

'愛' yields invalid_single_quoted_character #459

Closed triska closed 2 years ago

triska commented 4 years ago

Currently, I get:

?- X = '愛'.
caught: error(syntax_error(invalid_single_quoted_character),read_term/2)

Expected: X = '愛' or X = 愛.

matt2xu commented 4 years ago

It's kind of the same problem as the character in comments. The standard allows a limited set of character (graphic, alphanumeric, solo, space) or meta/control/octal/hexadecimal escape sequence. The lexer is implemented to follow these rules, and the 愛 character is indeed not part of this set.

I propose to "relax" the implementation, and invert the logic: if we don't have a quote or escape sequence, just accept any character. What do you think?

triska commented 4 years ago

@UWN could you please comment to ascertain that this is conforming? Thank you!

UWN commented 4 years ago

In 6.5 Processor character set, extended characters are implementation defined. So there is the chance to accept more characters. But going through it is a veritable mine field. SWI has kind-of tried it with questionable success. It works for most cases, but do not look at corner cases.

To accept any character is highly problematic. And, accept as what exactly? There are so many odd character classes. There are so many layout characters and the like. SICStus is a bit too radical classifying all code points above 255 as lower case. Even the upper case Œ which prevents one from using the variable Œuvre.

Some simple extension that includes the characters that once existed in the predecessor of Latin-1 plus all non-whitespace characters in quoted lists seems to be good enough for the moment.

triska commented 4 years ago

I would like to attract more Japanese users to Scryer Prolog, and for them atoms with Japanese characters are very important.

Is there a way to at least support the case where atoms are enclosed in single quotes? Would the suggested approach work for this?

mthom commented 4 years ago

Japanese characters tend to be in the range of (I believe) 6-byte graphemes. Currently, the unit of parsing is Rust's 4-byte char type, which corresponds to Unicode scalar values. The unicode-reader library prolog_parser now uses might accommodate graphemes (I think it does). Porting the parser and the rest of Scryer over to it is a fair bit of work, but should be doable.

UWN commented 4 years ago

Lest I forget, my MOCSH collection (= multi octet character set handling).

matt2xu commented 4 years ago

Japanese characters tend to be in the range of (I believe) 6-byte graphemes.

I was under the same impression that a UTF-8 character could use up to 6 bytes, but it appears not to be the case anymore, as UTF-8 has been limited to range 0000-10FFFF (same as UTF-16 apparently) since 2003: https://tools.ietf.org/html/rfc3629

Currently, the unit of parsing is Rust's 4-byte char type, which corresponds to Unicode scalar values.

Even if we needed to use 6-byte (encoded) sequences, we would still use 4-byte (decoded) chars, so there would not be a problem.

Is there a way to at least support the case where atoms are enclosed in single quotes? Would the suggested approach work for this?

I think we need to distinguish between variable names, and quoted chars/strings. In my experience programming languages usually allow anything in character literals and string literals, whereas identifiers are more limited. I don't really understand why Prolog would limit what characters can go in a single quoted char. While allowing Unicode characters in identifiers can be tricky as @UWN shows, I don't see the downside of allowing any characters between quotes, hence my proposal :)

See for instance:

What do you think?

triska commented 4 years ago

It certainly looks very sensible! However, the consquences would by quite profound, and the details to solve it are very involved:

One important guarantee that is ensured by the Prolog ISO standard is that you can print the source code of a program (I mean physically print it, on a piece of paper), and see everything that is necessary to type it from the paper to obtain the exact same program.

When we allow additional characters, such as Unicode spaces (think about U-00A0 NO-BREAK SPACE) in quoted strings and atoms, then this is no langer guaranteed.

Consequently, to preserve this property, we must somehow sensibly restrict what we allow in strings and quoted atoms. It would be ideal to find out whether any other programming language has already solved this issue so that this specific concern is also addressed.

matt2xu commented 4 years ago

One important guarantee that is ensured by the Prolog ISO standard is that you can print the source code of a program (I mean physically print it, on a piece of paper), and see everything that is necessary to type it from the paper to obtain the exact same program.

Does the standard say that? How is that important?

When we allow additional characters, such as Unicode spaces (think about U-00A0 NO-BREAK SPACE) in quoted strings and atoms, then this is no langer guaranteed.

If this is a concern, you can still use the existing "escape sequence" mechanism.

Note that Rust has many methods readily available to check if a given character is alphabetic, uppercase, etc that mirror the definitions in Unicode. You could have a Prolog text with variable names like Œuvre or Δ and it would work, and still be conforming (though not strictly conforming). It would be a nice feature :)

triska commented 4 years ago

Paper is a very important medium, for example for archival purposes, and this is a guarantee that is well worth preserving.

Hexadecimal escape sequences are a very good suggestion! They are defined in the standard as:


hexadecimal escape sequence (* 6.4.2.1 *)
   = backslash char (* 6.5.5 *),
     symbolic hexadecimal char (* 6.4.2.1 *),
     hexadecimal digit char (* 6.5.2 *),
     { hexadecimal digit char (* 6.5.2 *) } ,
     backslash char (* 6.5.5 *) ;

symbolic hexadecimal char (* 6.4.2.1 *)
   = "x" ;

Currently, we get for example:

?- Cs = "\x2124\".
caught: error(syntax_error(cannot_parse_big_int),read_term/2:0)

The character set is implementation defined:

6.5 Processor character set

The processor character set PCS is an implementation
defined character set. The members of PCS shall include
each character defined by char (6.5).

PCS may include additional members, known as extended
characters. It shall be implementation defined for each
extended character whether it is a graphic char, or an
alphanumeric char, or a solo char, or a layout char, or a
meta char.

Regarding strictly conforming:

  e) Offer a strictly conforming mode which shall reject
  the use of an implementation specific feature in Prolog
  text or while executing a goal.

where "implementation specific" means:

3.92 implementation specific: Undefined by this part of
ISO/IEC 13211 but supported by a conforming processor.

As far as I can tell, it would be strictly conforming to support extended characters (they are implementation defined). The key issue is how to best incorporate them. I think providing the mentioned escape sequences for extended hexadecimal sequences would definitely be a very nice initial addition!

matt2xu commented 4 years ago

Currently, we get for example:

?- Cs = "\x2124\". caught: error(syntax_error(cannot_parse_big_int),read_term/2:0)

This should count as a bug IMO. I saw that you opened issue #470 for that, I think this is something I can fix!

As far as I can tell, it would be strictly conforming to support extended characters (they are implementation defined). The key issue is how to best incorporate them. I think providing the mentioned escape sequences for extended hexadecimal sequences would definitely be a very nice initial addition!

I think hexadecimal sequences would definitely be interesting to support. Regarding extended characters, a conservative approach could be to support characters that are alphabetic, with the proper case class (so capital letter char would be chars that are uppercase). This would satisfy the "printable" property, would be backward-compatible, relatively easy to implement, and without raising too many questions (e.g. what about digits, control characters and other non-visible characters). What do you think?

triska commented 4 years ago

Thank you a lot for looking into the issue of hexidecimal sequences, that would already be a very nice addition!

Regarding further extensions, I think it would be a huge step forward to find a precedence in other programming languages for any considered extension, and study its consequences.

If you are interested in standardisation work on this feature, please contact the French member body AFNOR for delegation to ISO SC22 WG17. Working out such a conforming extension would be a tremendous contribution to the Prolog standard, and would make for a very nice project in the working group in which other members would certainly also be interested!

UWN commented 4 years ago

You could have a Prolog text with variable names like Œuvre or Δ and it would work, and still be conforming (though not strictly conforming).

If a system defines its PCS (processor character set) to include extended characters like the above one, then it is still strictly conforming, provided it uses the existing character classification scheme for those extended characters. However, there is currently no character class that is allowed in a quoted token and behind comments but nowhere else. That would be currently an extension qua 5.5.1.

Does the standard say that? How is that important?

It says it implicitly. By disallowing e.g. layout other than space in quoted tokens. How is that important? In the olden tymes paper and punched cards were important. Think of a text " \n". but using a real newline instead. In text mode this is indistinguishable from any other sequences of spaces. Today, it's screen shots, copies of terminals and the like. Note that in Prolog we often just use writeq/1 to "serialize" some data. And later, we read that back in. Similar things happen for diagnostic purposes. Syntax is not just for reading, but also for writing which makes many issues much more grave.

matt2xu commented 4 years ago

Thank you a lot for looking into the issue of hexidecimal sequences, that would already be a very nice addition!

Done!

If you are interested in standardisation work on this feature, please contact the French member body AFNOR for delegation to ISO SC22 WG17. Working out such a conforming extension would be a tremendous contribution to the Prolog standard, and would make for a very nice project in the working group in which other members would certainly also be interested!

I just contacted AFNOR, I have no idea where this will lead me :smile: I think we can close this issue now as I don't see other actions we could take. Do you agree?

UWN commented 4 years ago

It would be very fine to welcome again the founding nation in WG17!

If closing is appropriate, can we mark this thread such that it will be easily retrievable?

triska commented 4 years ago

I have contacted a Japanese specialist and asked for his input and recommendation.

Please let us keep this issue open for a while as a sign that we would welcome more feedback on this issue. I hope that this motivates interested contributors to help with these questions.

Thank you for contacting AFNOR, this is a great step towards normative work on this topic!

david-sitsky commented 2 years ago

I was hit by this issue today and was wondering has there been any progress? I am curious is hear feedback from the Japanese specialist or any progress re: standards.

This is perhaps a separate (but related) issue to report, but I can't even use strings which contain Japanese characters. I can certainly understand restrictions in identifiers, and perhaps atoms (although that seems unnecessarily limiting), but preventing strings from containing any kind of unicode character is a non-starter for any serious application:

?- X = "愛".
caught: error(syntax_error(missing_quote),read_term/3:0)

Curiously using an escape sequence helps, although the list of characters returned is empty, perhaps related to what is being reported here? Without this capability, unicode string processing in Scryer is not truly possible.

?- X = "\x611b".
   X = [].
UWN commented 2 years ago

For one, X = "\x611b". is invalid syntax see #1354

UWN commented 2 years ago

A closing \ (backslash) is needed:

?- X = "\x611b\", X = [C].
   X = "\x611b\", C = '\x611b\'.

So the character seems to be representable.

david-sitsky commented 2 years ago

Ah I see.. I was following the escape examples written earlier but it is good to see this is representable as characters.

Should this issue below (using unicode characters directly in a string without escaping) be handled as a separate bug then, or is it related to this issue?

?- X = "愛".
caught: error(syntax_error(missing_quote),read_term/3:0)
triska commented 2 years ago

This is the same issue: A double-quoted list as in your example is a list of one-char atoms:

6.3.7 Terms - double quoted list notation

A double quoted list is either an atom (6.3.1.3) or a list
(6.3.5).

If the Prolog flag double_quotes has a value chars, a
double quoted list token dql containing L double
quoted characters is a list i with L elements, where
the N-th element of the list is the one-char atom whose
name is the N-th double quoted character of dql.

One-char atoms that appear in double-quoted lists should also be accepted as standalone atoms, and vice versa.

triska commented 2 years ago

I now get:

?- X = '愛'.
   X = '愛'.
?- Ls = "私はあなた", Ls = [L|Rs].
   Ls = "私はあなた", L = '私', Rs = "はあなた".

So, this now works perfectly, and Scryer Prolog seems now very well suited for conveniently processing strings in many languages. Thank you a lot!

haskie-lambda commented 2 years ago

I know this has been closed for a while now, but I am experiencing something very similar with emojis:

using json_chars I'm parsing API output that contains emoji characters such as ✅. The resulting JSON is serialized using writeq/2 or write_term/3 and written to a file. This file is read later on using consult or read/2. However, when the JSON contains unicode emojis such as ✅, I am getting the error

error(syntax_error(missing_quote),read/2:0).

Which suspiciously sounds like the issue that you had with the Japanese language characters. This was changed/fixed for these characters, but the underlying problem remains. Even if this is not about to change due to the effort it takes, how would I get around this issue efficiently? Should't read be able to understand and parse anything that write* emits?

UWN commented 2 years ago

(Edit: this was a bad example)

?- open(fiche,write,S),writeq(S,p("\2705\",'\2705\')),put_char(S,.),close(S).
   S = '$dropped_value'.
?- [fiche].
   true.
?- p([X],X).
   X = ׅ.
?- halt.
ulrich@TU-Wien:/tmp$ od -c fiche
0000000   p   (   " 327 205   "   , 327 205   )   .
0000013

Same result?

haskie-lambda commented 2 years ago

So write and read are not supposed to be their mutual inverse?

Nevertheless, is there a sensible way serialize the parsed JSON so that it can be deserialized again, with unicode emojis intact?

UWN commented 2 years ago

Maybe you want to give a more reproducible example of your problem.

haskie-lambda commented 2 years ago

I have a file tickets.json:

{ "ticket": "❌ not done yet" }

using

?- use_module(library(pio)).
?- use_module(library(serialization/json)).
?- use_module(library(charsio), [get_n_chars/3]).

?- phrase_from_file(json_chars(JSON),'tickets.json'),
    open('tickets.pl',write,File),
    writeq(File,JSON),
    write(File,'.'),
    close(File).

I write the parsed version of the ticket to tickets.pl which now looks like this:

pairs([string("ticket")-string("❌ not done yet")]).

Now i would like to query this structure, so I load it from the file:

?- open('tickets.pl',read,File), read(File,ParsedJSON), close(File).

But instead of being able to view ParsedJSON, I get

error(syntax_error(missing_quote),read/2:0).

because of the emoji character. Replacing the red in tickets.json with a normal x results in everything working out:

?- open('tickets.pl',read,File), read(File,ParsedJSON), close(File).
   File = '$dropped_value', ParsedJSON = pairs([string("ticket")-string("x not done yet")]).
UWN commented 2 years ago

Now I realize, that I just was too sloppy, the test should have been:

?- open(fiche,write,S),writeq(S,p("\x2705\",'\x2705\')),put_char(S,.),close(S).
   S = '$dropped_value'.
?- [fiche].
   error(syntax_error(missing_quote),read_term/3:0), unexpected.
   false.
?- L="\x2705\".
   L = "✅".
?- L = "✅".
   error(syntax_error(missing_quote),read_term/3:0), unexpected.

So this is still a problem.

triska commented 2 years ago

Related: #1515.

triska commented 2 years ago

Related: #1549.

triska commented 1 year ago

With the latest git version of Scryer, the example from https://github.com/mthom/scryer-prolog/issues/459#issuecomment-1327697204 works as expected:

?- L = "✅".
   L = "✅".

The examples from https://github.com/mthom/scryer-prolog/issues/459#issuecomment-1327293682 also work correctly now.