Generated module names contain spurious (er, superfluous) underscores

nolanamy commented 11 years ago

Using a package name like package greeting_api; should generate code in a module like module GreetingApi. Instead, the underscore remains and we get module Greeting_Api.

Source: https://gist.github.com/nolanamy/5290516 Generated: https://gist.github.com/nolanamy/5290512

Is a continue needed at https://github.com/localshred/protobuf/blob/master/ext/ruby_generator/RubyGenerator.h#L128?

localshred commented 11 years ago

Ironically, issue #56 is almost the exact opposite of this request. We used to ignore underscores in a namespace/package, but now we support them.

You are correct on the continue line, it was the only thing changed to fix #56.

I guess one way we could approach this is to remove underscores in a package name and leave them in a message name. This kind of feels weird to me, to have a double standard on "constantizing" a token.

What do you think?

nolanamy commented 11 years ago

Interesting! I agree, ideally, there's only one constantizing standard. In particular, message names must be constantized the same when they're declared and when they're used (see #82). Though it's hard to say what's the right thing to standardize on... Here are some options:

Only convert . to ::. This will keep things consistent (without requiring constantization of class names), but non-capitalized message names will cause syntax errors when the generated file is used.
- The Style Guide specifies CamelCase (with an initial capital) for .proto message names, so in theory there should be no need to CamelCasify them.
Capitalize the first letters of segments, but leave underscores in place. This will require constantizing class names as well (to resolve #82).
Capitalize first letters and remove underscores. Again, the same scheme will have to be applied to class names to keep it consistent.
Capitalize first letters and only remove underscores when followed by a lowercase letter. More complication than it's worth, I'd bet.

nolanamy commented 11 years ago

For now, I'll just use package GreetingApi to get the effect I was looking for. I'm using the same protobuf file in Java, so this might cause problems there, but it looks like option java_package = ... mostly overrides the package declaration...

It turns out Java uses the file name in much the same way that Ruby uses the package declaration; I was using the package declaration (and making it the same as the file name) to get the generated Ruby code to have more or less the same structure as Java.

localshred commented 11 years ago

tl;dr We should use ActiveSupport#camelize style camelization as our standard and fail when multiple separate messages compile to the same class name.

I agree we need to come up with a standard. In the ruby community the standard from my perspective would be the ActiveSupport camelize and underscore methods available on String instances. Yes it is from Rails, but there really isn't a more standard library most ruby developers are familiar with. I did a quick and dirty irb session to map out source strings and their camelized equivalents and got this table:

+-----------------+------------------+
| source          | camelized        |
+-----------------+------------------+
| simple_mail     | SimpleMail       |
| simple_Mail     | SimpleMail       |
| Simple_Mail     | SimpleMail       |
| simpleMail      | SimpleMail       |
| SimpleMail      | SimpleMail       |
| SimpleMAIL      | SimpleMAIL       |
| SIMPLE_MAIL     | SIMPLEMail       |
| SIMPLEMAIL      | SIMPLEMAIL       |
| simple_mail.foo | SimpleMail::Foo  |
+-----------------+------------------+

Simply adding the continue back on underscore chars, we would support all but the SIMPLEMail example. Weirdly the regex they are using in activesupport behaves that way.

Now, at this point you probably see the problem we have just uncovered. It existed all along, but working on this I realized the issue. You'll notice that the first 5 entries in the table all have different source tokens, but the ruby equivalent camelized class name is the same.

This presents us with a conundrum of sorts. We can either a) Fail when a camelized class name has already been taken by a class in that namespace; or b) Ensure first characters of message names are upcase to conform to ruby syntax for class constants and leave the rest of the token alone.

On the one hand, the compiler/generator obviously allows each of the source cases listed above since in c++ those tokens are entirely valid within that namespace, none of them step on each others toes.

On the other hand, by enforcing the ruby way of class naming with CamelCase (say by raising a compiler error when more than one message compiles with the same ruby class name), we potentially wreck a developers ability to name things as they would like. This reason is actually why in #56 I was ok with allowing class names to contain variables.

I feel like common sense will have to rule over flexibility here. Yes the cpp compiler allows you to define a simple_mail message and a Simple_Mail message, but you should never do that, it would be too confusing. Thus having the compiler enforce this message overlap issue by raising an error is acceptable to me.

Sorry for the brain dump, but you obviously got me thinking. Do you have any thoughts or is this proposal acceptable to you?

liveh2o commented 11 years ago

My two cents: I think it's very reasonable that for the purposes of compiling to Ruby, case in package names shouldn't matter. Most people will pick one of the styles you listed and stick with it, so I think failing when a class name has already been taken is a sane approach.

nolanamy commented 11 years ago

I'm tempted vote for leaving the token entirely alone and raising a compiler error if the first character is lowercase.

requires less work on our part (no need to search for name collisions)
simpler for the developer
- error just tells them that ruby doesn't support lowercase class/message names
- ruby tokens are identical to the .proto tokens (no mysterious name changes)
closest to what we're currently doing (requiring fewer changes to developers' ruby code)
- when we change how we massage tokens, we'll break developers' existing ruby code (that's how I ended up submitting this issue!)
any proto source that works with the current generator and doesn't raise ruby syntax errors will still work fine
- with camelize, some protos that work now will stop working when poorly-named messages collide

The danger in this approach is that existing .proto files (with lowercase names) that work with Java/C++ will require changes to work with the ruby protobuf generator; keep in mind, though, that they've never worked in ruby before.

The only way to avoid having to change proto files to work with ruby is to massage proto tokens one way or another, at the very least capitalizing the first letter. There are two main problems with changing tokens in the generator:

Breaking proto code (due to token collisions)
- proto files that used to compile will stop working if their names collide
- the more we massage tokens, the more likely these collisions are (see @localshred 's camelize table above)
Breaking ruby code (due to token changes)
- code that works with the current generated code will break if tokens start getting changed (using camelize would break @dekz 's ruby code re. #56)
- like collisions, the more we massage, the more likely we'll break someone's code (just capitalizing the first letter won't break the #56 code)

So, in my opinion, the less we change tokens, the better. If we do it at all, we should only capitalize the first letter. But I think leaving tokens entirely alone is the best solution. Minimal broken ruby code, and the only proto files that don't work never worked anyway.

localshred commented 11 years ago

Thanks @nolanamy and @liveh2o for your suggestions, it's nice to have a few minds thinking through the implications.

I feel like the solution proposed by @nolanamy is the right way to go, that is, no token modification whatsoever. This obviously can lead to un-runnable ruby code if message/enum/service names do not have a capital first letter. I think in this case I'll simply raise an error. The compiler won't change any tokens in your generated ruby code, but it won't let you shoot yourself in the face either by generating bad code. It will guide you to runnable ruby definitions.

There is another artifact of token changing in the current compiler: we underscore any rpc method names. So if you defined rpc SomeMethod (Foo) returns (Bar);, the method would end up being compiled to some_method. Indeed, the server/service code in the gem also enforces that the method name is underscored when a request is received. Going with the policy of zero token manipulation, I would remove this change as well (in both the compiler and runtime server code).

And finally, @nolanamy you are correct that whatever change we make will break code. As such, my preference is not to release this with the 2.7 branch, but wait to make this change when we can release version 3 (which is coming). That way we can appropriately message gem users and an expectation of possible breakage will be understood.

So, to recap:

No token manipulation, period.
Compiler raises an error if message, service, or enum names begin with lower case.
Pushed with version 3.

Everyone cool with this?

nolanamy commented 11 years ago

Sounds pretty good to me.

The only thing that irks me is that RPC method names are supposed to be CamelCased in proto code but ought to be snake_cased in ruby code. For everything else - Messages, Fields, and Enums - the protobuf style guide specifies a ruby-friendly style:

Proto type	Protobuf style	ruby style	ruby type
Message	CamelCase	CamelCase	class
Field	snake_case	snake_case	variable
Enum types	CamelCase	CamelCase	class
Enum values	SCREAMING_SNAKE	SCREAMING_SNAKE	constant
RPC method	CamelCase	snake_case	method

liveh2o commented 11 years ago

:+1:

localshred commented 11 years ago

Yes @nolanamy I agree about the annoyance that their style guide is in conflict with ruby's, but I'd rather opt for not modifying tokens at all. Let people define things how they will. If they are rubyists they will probably define the methods as snake case anyways.

nolanamy commented 11 years ago

:ok_hand:

localshred commented 11 years ago

Resolved under > 2.8.x.

ruby-protobuf / protobuf

Generated module names contain spurious (er, superfluous) underscores #81