protocolbuffers / protobuf

Protocol Buffers - Google's data interchange format
http://protobuf.dev
Other
65.71k stars 15.5k forks source link

Parser erroneously allows whitespace in package declaration. #2128

Open JakeWharton opened 8 years ago

JakeWharton commented 8 years ago

We had a case where protoc allowed a package declaration that had newlines in it but other tooling based on the spec failed to parse the file. I believe this should have failed to parse inside protoc as well.

Relevant parts of the spec:

package = "package" fullIdent ";"
fullIdent = ident { "." ident }
ident = letter { letter | decimalDigit | "_" }
letter = "A" … "Z" | "a" … "z"
decimalDigit = "0" … "9"

Example showing protoc allowing newlines:

$ echo "package example.

foo

;

message Foo {}" > test.proto

$ protoc --version
libprotoc 3.0.0

$ protoc --java_out=. test.proto
[libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: test.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)

$ tree
.
├── example
│   └── foo
│       └── Test.java
└── test.proto
haberman commented 7 years ago

Yes this seems like a bug.

xfxyjwf commented 6 years ago

@anandolee It's the syntax spec that needs a fix here. A package name spreading across multiple lines is needed when the name is very long. For example:

message Foo {
  optional a.very.very.long.name.that.can.not.fit.into.one.line.without.exceeding
      .80.char.limit.Bar value = 1;
}
haberman commented 6 years ago

@xfxyjwf Is the syntax spec checked in anywhere? What actually needs to be fixed here?

xfxyjwf commented 6 years ago

The syntax spec here: https://developers.google.com/protocol-buffers/docs/reference/proto2-spec

The spec needs to specify where spaces are allowed and where not.

fowles commented 2 years ago

I suspect this is still a bug...

jhump commented 2 years ago

There is definitely code in the wild that expects qualified names (dot-separated sequence of identifiers, such as in type names, extension names in message literals and custom options) to allow whitespace, including newlines, in between name components. (Admittedly, this might be less likely in package declarations.) But either way, this would not be a backwards compatible change.

There is definitely an issue with the clarity in the spec (or lack thereof). Which productions are lexical elements (where whitespace is generally disallowed between adjacent rule references)? Which are syntax elements (where whitespaces are not only allowed but sometimes required between adjacent rule references)? The whole dichotomy of tokenization vs. parsing is not mentioned at all, and there is nothing in the spec about whitespace or comments.

jhump commented 2 years ago

FWIW, most languages that use dots as a "selector" operator for building qualified names do support whitespaces around the dots, even if most users don't ever write code that way. C++, Java, Go, etc...

Admittedly, Go does not allow you to define a package name with a dot -- it must be a simple identifier (and the fully-qualified name is its import path). But still, the following are both valid:

// test.go
package main
import "fmt"
import "runtime"
func main() {
 fmt . Print ( runtime. GOMAXPROCS ( -1))
}
// test.java
package com . bufbuild . parser;

import java . util .Arrays;

class Test {
  public static void main(String[] args) {
    System . out . println ( Arrays . asList(args) );
  }
}

Having said all that, there is one place where the extra whitespace is more likely to be considered a bug: type URLs in the text format for Any messages.

// this is valid
syntax = "proto3";

package foo.bar;

import "google/protobuf/any.proto";
import "google/protobuf/descriptor.proto";

extend google.protobuf.MessageOptions {
  google.protobuf.Any xtra = 10101;
}

message Foo {
  string s = 1;
  uint64 i = 2;

  option (xtra) = {
    [ type . googleapis . com / foo . bar . Foo ] < s: "foo",  i: 123 >
  };
}
github-actions[bot] commented 4 months ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago.

Logofile commented 4 months ago

AFAIK, this hasn't been fixed and should remain open.

That said, I'm not sure that this is solely a documentation issue. In the original description, @JakeWharton points out that "protoc allowed a package declaration that had newlines in it but other tooling based on the spec failed to parse the file." That's seems to point to a functionality issue along with the documentation issues.

If we do want to update the spec to reflect the ability to have whitespace characters between the package name elements, I'll need some direction on how to show that. I'm not EBNF proficient.

github-actions[bot] commented 1 month ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please add a comment.

This issue is labeled inactive because the last activity was over 90 days ago. This issue will be closed and archived after 14 additional days without activity.

JakeWharton commented 1 month ago

That's not how bugs work. Bad bot.