UTF-16 characters in strings results in syntax error

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Create a new proto definition file as follows:

message Foo {
    optional string bar = 2 [default="\302\265"]; //UTF-16 micro (metric prefix)
}

2. Observe that the UTF-16 in the default value is considered an error by the 
XText Editor.

3. Observe that protoc outputs UTF-16 formatted strings when you convert a 
descriptor back to a file. i.e. DebugString

What is the expected output? What do you see instead?

That's debatable - I'm not sure anyone ever intended protoc to use UTF-16 
strings, but they currently work.  I'm not sure how difficult it would be to 
modify the definition of a String in XText - maybe this issue should be posted 
to them for the base grammar support to add UTF-16 strings.

What version of the product are you using? On what operating system?

1.0.0
Eclipse 3.7 PDE
Windows 7 x64

Please provide any additional information below.

Probably a low priority issue as I doubt many others are using UTF-16 strings.

Original issue reported on code.google.com by compuwar...@gmail.com on 4 Aug 2011 at 3:07

GoogleCodeExporter commented 9 years ago

Original comment by alr...@google.com on 4 Aug 2011 at 8:07

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

In the grammar the terminal STRING is overridden, but there is validation built 
into XText / EMF / Antlr that ignores the actual grammar for String.  As such, 
the terminal STRING cannot be modified to accomodate escaped unicode characters 
- an alternate terminal must be created.

Also, we need to keep the override on the STRING terminal to ensure that it 
matches BEFORE the UNICODE_STRING terminal because things such as importURI 
require that the terminal be a STRING to work correctly...

My Solution is to modify Protobuf.xtext as follows:

StringRef:
  string=UNICODE_STRING | string=STRING;

terminal STRING:
  SL_STRING (SL_STRING)*;

terminal SL_STRING:
  '"' ('\\' ('b' | 't' | 'n' | 'f' | 'r' | 'u' | '"' | "'" | '\\') | !('\\' | '"'))* '"' (WS)*;

terminal UNICODE_STRING:
  UNICODE_SL_STRING (UNICODE_SL_STRING)*;

terminal UNICODE_SL_STRING:
  '"' ('\\' ('b' | 't' | 'n' | 'f' | 'r' | 'u' | '"' | "'" | '\\' | UNICODE_OCTAL) | !('\\' | '"'))* '"' (WS)*;

terminal UNICODE_OCTAL:
    ('0'..'3')('0'..'7')('0'..'7');

Which will allow octal unicode characters to be included in a string.

(on a side note... can anyone enlighten me on the purpose of STRING vs 
SL_STRING?)

The UNICODE_STRING terminal must then have syntax highlighting added so that 
it's treated just like any other string...

So in the com.google.eclipse.protobuf.ui project the class
com.google.eclipse.protobuf.ui.editor.syntaxcoloring.ProtobufSemanticHighlightin
gCalculator

The following methods need updates...
  private void highlightOptions(Field field, IHighlightedPositionAcceptor acceptor) {
    for (FieldOption option : field.getFieldOptions()) {
      String highlightId = fieldOptions.isDefaultValueOption(option) ? KEYWORD_ID : DEFAULT_ID;
      highlightName(option, acceptor, highlightId);
      ValueRef ref = option.getValue();
      if (ref instanceof LiteralRef) {
        highlightFirstFeature(option, FIELD_OPTION__VALUE, acceptor, ENUM_LITERAL_ID);
        return;
      }
      if (ref instanceof NumberRef) {
        highlightFirstFeature(option, FIELD_OPTION__VALUE, acceptor, NUMBER_ID);
        return;
      }
      if (ref instanceof StringRef) {
        highlightFirstFeature(option, FIELD_OPTION__VALUE, acceptor, STRING_ID);
      }
    }
  }

  private void highlight(Option option, IHighlightedPositionAcceptor acceptor) {
    highlightName(option, acceptor, DEFAULT_ID);
    ValueRef ref = option.getValue();
    if (ref instanceof LiteralRef) {
      highlightFirstFeature(option, OPTION__VALUE, acceptor, ENUM_LITERAL_ID);
      return;
    }
    if (ref instanceof NumberRef) {
      highlightFirstFeature(option, OPTION__VALUE, acceptor, NUMBER_ID);
      return;
    }
    if (ref instanceof StringRef) {
        highlightFirstFeature(option, OPTION__VALUE, acceptor, STRING_ID);
    }
  }

I've attached the modified files and a screenshot of the working octal unicode 
strings...

Not sure of any other implications from not using the built-in STRING terminal 
rule but I searched around for code references and couldn't find any other than 
the lack of highlighting.

PS:  I don't know how to run all the tests in the test projects - so I can't 
confirm they still all pass with this change.

Original comment by compuwar...@gmail.com on 4 Aug 2011 at 9:20

Attachments:

GoogleCodeExporter commented 9 years ago

I've been looking at the unit test protos using protobuf-dt and noticed that 
this likely overlaps with a "failed" test case right now in unittest.proto (and 
others).

ByteStrings use escaped bytes but the 'hard coded' EMF STRING type prevents 
those from working quite right as well.

unittest.proto line 467

message TestExtremeDefaultValues {
  optional bytes escaped_bytes = 1 [default = "\0\001\a\b\f\n\r\t\v\\\'\"\xfe"];

results in a syntax error as the string escaping does not work as expected by 
protoc

Original comment by compuwar...@gmail.com on 25 Aug 2011 at 3:50

GoogleCodeExporter commented 9 years ago

The prior suggested solution can be updated to account for the information in 
the unit test cases as such:

And UNICODE_SL_STRING needs to be updated (maybe this should be PROTO_SL_STRING 
or something better named?)

terminal UNICODE_SL_STRING:
  '"' ('\\' ('a' | 'b' | 't' | 'n' | 'f' | 'r' | 'u' | 'v' | 'x' (NUMBER | 'a'..'f' | 'A'..'F') (NUMBER | 'a'..'f' | 'A'..'F') | '"' | "'" | '\\' | UNICODE_OCTAL) | !('\\' | '"'))* '"' (WS)*;

which now includes the escape characters used in the unit test 'a', 'v', and 
hexadecimal character notation 'x##' where # is 0-F

Also, the previously suggested solution is flawed in that terminal call 
UNICODE_OCTAL as it can match terminals outside the scope of a string.

the correct replacement is to fold that into the UNICODE_SL_STRING terminal as 
such:

terminal UNICODE_SL_STRING:
  '"' ('\\' ('a' | 'b' | 't' | 'n' | 'f' | 'r' | 'u' | 'v' | 'x' (NUMBER | 'a'..'f' | 'A'..'F') (NUMBER | 'a'..'f' | 'A'..'F') | '"' | "'" | '\\' | ('0'..'3')('0'..'7')('0'..'7')) | !('\\' | '"'))* '"' (WS)*;

Original comment by compuwar...@gmail.com on 25 Aug 2011 at 4:22

GoogleCodeExporter commented 9 years ago

r6a51f6bb3f25

Original comment by alr...@google.com on 9 Sep 2011 at 8:02

Changed state: Fixed
Added labels: Milestone-Release-1.0.1

mycoolmc / protobuf-dt

UTF-16 characters in strings results in syntax error #91