Closed dcodeIO closed 7 years ago
Very excited about this, looking forward to it @dcodeIO!
I hope there is a high-performance toJSON method for protos. Currently, we do toRaw and fix up the Longs. toRaw is almost all what our profiles show. If with the rewrite if performance is improved that will be an added plus!
I haven't yet touched toJSON on message instances at all and I also hate the current one. Looking forward to your ideas to get it right this time!
if you haven't read this https://www.reaktor.com/blog/javascript-performance-fundamentals-make-bluebird-fast/ Since this is a rewrite, it would be great to get some of these fundamentals in there.
You can now have a look for yourself: https://github.com/dcodeIO/protobuf.js It's still pretty unfinished, but it works for the most simple test cases. If you see something stupid, feel free to let me know. If you want something implemented in a specific way, the same.
It looks much more readable. I haven't dug through in detail yet, but I generally read through most of the implementations of types and the encoding/decoding portion. The reflection part doesn't looks that much different with $type. What I want to check is the generated files, which I haven't get had a chance to look at. The goal should be that any code that avoids using reflection should run very fast. Maybe some performance tests will help.
There is no code generation, yet. What is there, currently, is a way to properly extend the runtime message Prototype
, which by default creates reflection-based encode/decode methods on the constructor.
The idea here is that it should be possible to replace even that with a custom implementation, specifically built for the type of message you are using, through code generation. How this will look is still tbd.
Also, there are multiple possible ways of doing this: You could either just wrap your custom class around reflection, either manually or generated, or generate code that uses protobuf.Writer
etc. directly, which is what you are referring to. You are not forced anymore to do it in a specific way.
All these sounds good. One more wishlist I want to through in the ring is to create the generated classes with an immutable + builder interface. So that there are no setters on the decoded objects and a builder pattern which always does copy. You might want other users of the library to weigh in on this. This wish list is based on some of the production bugs that I have seen in the past (silly mistakes) that could have been loud failures.
I suspect (without proof and 2nd hand information) the Javascript engines can detect those immutable objects and convert that to c-structs. Could be a worthwhile performance improvement
I see. At the moment, there are no setters and no methods on runtime message instances. These just extend Prototype and have some convenience methods on the constructor. Wanted that as the default when working with reflection and it's up to the user to decorate their message instances. see: https://github.com/dcodeIO/protobuf.js/blob/master/src/prototype.js
It does however use references and does not copy in its current state by default. See the constructor. You are not forced to call that constructor, though, and can initialize message instances another way.
Can we do this in Typescript pretty please with proper modularity?
The library itself is plain JS, but first class typescript support (.d.ts for the library and codegen for .ts) is definitely something I'd like to have.
If you're rewriting it it's far easier to do it in typescript and publish it as a compiled library with auto generated .d.ts
You might have missed this: https://github.com/dcodeIO/protobuf.js/tree/master
I understand your desire for a full TypeScript implementation, I thought about this too, but because of the amount of low level stuff going on I decided against this.
Noticed that @englercj already made a d.ts generator on top of jsdoc. First bits here: https://github.com/dcodeIO/protobuf.js/tree/master/types Still some issues with multi dimensional arrays and generic promises, but looks good for a start.
Edit: Added a simple post-process step for now that makes it a valid .d.ts.
I know your pain from trying to do things like arbitrary class constructors in the past with typescript. In the end the code is the same - my suggestion for typescript is mostly around development ease but it looks like you're doing just fine with raw JS.
I like the .d.ts, looks good. Looking forward to testing out the new code soon.
One possible consideration here is first class gRPC service support. Having the proper hooks to support use cases like https://github.com/grpc-ecosystem/grpc-gateway and/or https://github.com/tmc/grpc-websocket-proxy would be nice.
With regard to code generation, is your intent to generate typed clients given a defined proto service? This seems like it would be quite useful. I have some basic work in generating js client code here: https://github.com/tmc/grpcutils (basic elm and flow type generation).
@tmc also note http://GitHub.com/paralin/grpc-bus
If we can do anything to simplify how they use the library, yeah, then we should probably do it. But we should make sure that we don't add functionality to core that actually should be a library on its own.
Regarding code generation: What I am currently trying to do is to design a minimum common interface that could then be used to generate entire projects of JS or TypeScript classes. This can be done in a couple of different ways of course, like either just wrapping reflection or actually emitting message-specific encoders and decoders on top of the new reader and writer interface.
Today, I have some good and some bad news alike: I've just added getters and setters to the runtime message prototype as this seems to be the only working solution to handle oneof fields properly. That's because any fields that are part of a oneof have side effects on other fields of the oneof. The bad news is that, to make this work, I had to introduce a hidden property $values
on the prototype that initially just holds the default values but is then used to store and retrieve field values. Explanation: When just having getters and setters on the prototype, there is no way to override them on the instance with a simple value. It's always the setter being called, no matter what. Hence, there is no way around a hidden property without using ES6 proxies. But there is also good news: This allows us to introduce stuff that wasn't possible before, like checking whether a field has been manually set or not. See for yourself: https://github.com/dcodeIO/protobuf.js/blob/master/src/prototype.js
https://github.com/dcodeIO/protobuf.js/blob/master/src/index.js
Example: https://github.com/dcodeIO/protobuf.js/blob/master/scripts/sandbox.js#L49
I would vote for generating message-specific encoders and decoders on top of the new reader and writer interface (if there is proof that performance is there). So before going too deep, maybe a prototype and benchmarks would help. From a code size point of view the post gz-compression size may not be that different of generated code and that wrapping reflection.
I too am of the opinion that this lib should focus on code-gen and not so much reflection. Ideally it would work just like protoc (or a plugin for protoc), in that I got generated code that did all the things necessary. Even better would be support for the different optimize_for
modes:
SPEED
(default): The protocol buffer compiler will generate code for serializing, parsing, and performing other common operations on your message types. This code is extremely highly optimized.
CODE_SIZE
: The protocol buffer compiler will generate minimal classes and will rely on shared, reflection-based code to implement serialialization, parsing, and various other operations. The generated code will thus be much smaller than with SPEED, but operations will be slower. Classes will still implement exactly the same public API as they do in SPEED mode. This mode is most useful in apps that contain a very large number .proto files and do not need all of them to be blindingly fast.
SPEED
not CODE_SIZE
, but that isn't an option currently.My vision for SPEED
was, until now, to emit code that relies solely on the Reader/Writer interface. Code like that could exclude pretty much everything else. However, I'm not sure if that is low level enough anymore as I don't see a relevant speed boost compared to reflection, which is well optimized now, without digging deeper. I imagine, though, that it is quite hard to get code generation for all sorts of possible cases right without a Reader/Writer class that is properly optimizable. For example, you don't want to feature-check all the time within your generated code. What the Reader/Writer interface currently does is set itself up according to the environment exactly once, eliminating any conditionals, and from then on just roll.
Something like:
var codegen = require("protobufjs-codegen"),
Reader = codegen.Reader,
Writer = codegen.Writer;
function AwesomeMessage(properties) {
// Generated initializer specific to this message
}
AwesomeMessage.encode = function(instance) {
// Generated encoder specific to this message
return buffer;
};
AwesomeMessage.decode = function(buffer) {
// Generated decoder specific to this message
return instance;
};
But as I said, that should be hardly faster than a reflection based encoder/decoder once jit is warmed up. This eliminates a couple of function calls, which is relevant in a native environment, but within JS, there are other, more important bottlenecks that cannot be circumvented most of the time. If there is a relevant speed boost, than by skipping even the reader/writer interface and intentionally not supporting specific environments.
As a reference, this is how a type encodes/decodes currently: https://github.com/dcodeIO/protobuf.js/blob/master/src/type.js#L325 https://github.com/dcodeIO/protobuf.js/blob/master/src/type.js#L361
I am currently getting an issue with the following:
syntax = "proto3";
import "google/protobuf/descriptor.proto";
//-------------------------------------------------------------------------------
extend google.protobuf.MessageOptions
{
// The string name of the message to extend
string extends = 21000;
// When set the message is only available on the server
// (probably used for server-to-server communication)
boolean is_internal = 21001;
}
The error is:
Error: illegal name '=' (line 10)
Are extensions not ready yet, I use custom options for messages, services, and methods. I also tried to use options without first defining them in an extension and got a different error:
Error: illegal token 'extends' (')' expected, line 35)
Any idea when options/extensions will make it in?
There has been an issue with parsing extend fields with no rule. This should be fixed now.
Side note: Added code generation directly into reflection. Nearly as fast as native JSON for a simple test case identical with that of the smart guy with the idea.
I now also added a graceful fallback if the environment throws on code generation for whatever reason. This adds some size to the library, but it's surely worth it. Still todo: MapField#encode/decode
Now, the reasons that it's still slower than JSON probably are that there is still some checking going on and that we have the reader/writer abstraction. Ideas?
Still getting an error, though a different one now.
TypeError: object must be one of Enum, Type, Service, Namespace
This is thrown from here:
NamespacePrototype.add = function add(object) {
if (!object || nestedTypes.indexOf(object.constructor) < 0)
throw util._TypeError("object", nestedError);
object
is a Field
and the namespace is Root
. I get the error with the same proto as above.
Ah, currently it doesn't accept fields declared within namespaces, which is valid for extension fields I assume. Will fix that soon.
Edit: Should be working now.
Here is the profile for the bench.js run as NODE_ENV=production node --prof scripts/bench.js 500000 true Just means top functions are already optimized (lazily). All I can think of is we may be able to play with the code to optmize allocation/concatenation.
12 656 4.3% 4.3% LazyCompile: *Uint8ArrayConstructByArrayBuffer native typedarray.js:144:42
13 424 2.8% 2.8% LazyCompile: *decode_internal /Users/robinanil/work/protobuf.js/src/type.js:431:49
14 384 2.5% 2.5% LazyCompile: *Buffer.concat buffer.js:288:25
15 348 2.3% 2.3% LazyCompile: *fork /Users/robinanil/work/protobuf.js/src/writer.js:327:37
16 285 1.9% 1.9% LazyCompile: *Uint8Array native typedarray.js:241:31
17 278 1.8% 1.8% Stub: CEntryStub
18 269 1.8% 1.8% Builtin: JSConstructStubGeneric
19 228 1.5% 1.5% Builtin: Call_ReceiverIsNotNullOrUndefined
20 226 1.5% 1.5% Builtin: CallFunction_ReceiverIsNotNullOrUndefined
21 225 1.5% 1.5% LazyCompile: *Buffer.allocUnsafe buffer.js:135:30
22 206 1.4% 1.4% Stub: FastNewStrictArgumentsStub
23 200 1.3% 1.3% Stub: FastNewObjectStub
24 200 1.3% 1.3% Builtin: ArgumentsAdaptorTrampoline
25 192 1.3% 1.3% Builtin: ReflectConstruct
Profiled a bit and optimized writer forking. Also added codegen for Type#decode. Run the benchmark again, guys.
Now working on single-function decoders. That should speed things up even more and could be reused for codegen of custom classes.
function _Package$decode($resolvedTypes, $toHash, reader, message, limit) {
"use strict";
while (reader.pos < limit) {
var tag = reader.tag();
switch (tag.id) {
case 1:
message.$values["name"] = reader.string();
break;
case 2:
message.$values["version"] = reader.string();
break;
case 3:
message.$values["description"] = reader.string();
break;
case 4:
message.$values["author"] = reader.string();
break;
case 5:
message.$values["license"] = reader.string();
break;
case 6:
var type = $resolvedTypes[5];
message.$values["repository"] = type.decodeDelimited_(reader, type._constructor ? new type._constructor() : Object.create(type.prototype));
break;
case 7:
message.$values["bugs"] = reader.string();
break;
case 8:
message.$values["main"] = reader.string();
break;
case 9:
var length = reader.uint32(), map = {};
if (length) {
length += reader.pos;
var keys = [], values = [], ki = 0, vi = 0;
while (reader.pos < length) {
tag = reader.tag();
if (tag.id === 1)
keys[ki++] = reader.string();
else if (tag.id === 2)
values[vi++] = reader.string();
else
throw Error('illegal wire format for .Package.bin');
}
var key;
for (ki = 0; ki < vi; ++ki)
map[typeof (key = keys[ki]) === 'object' ? $toHash(key) : key] = values[ki];
}
message.$values["bin"] = map;
break;
case 10:
var length = reader.uint32(), map = {};
if (length) {
length += reader.pos;
var keys = [], values = [], ki = 0, vi = 0;
while (reader.pos < length) {
tag = reader.tag();
if (tag.id === 1)
keys[ki++] = reader.string();
else if (tag.id === 2)
values[vi++] = reader.string();
else
throw Error('illegal wire format for .Package.scripts');
}
var key;
for (ki = 0; ki < vi; ++ki)
map[typeof (key = keys[ki]) === 'object' ? $toHash(key) : key] = values[ki];
}
message.$values["scripts"] = map;
break;
case 11:
var length = reader.uint32(), map = {};
if (length) {
length += reader.pos;
var keys = [], values = [], ki = 0, vi = 0;
while (reader.pos < length) {
tag = reader.tag();
if (tag.id === 1)
keys[ki++] = reader.string();
else if (tag.id === 2)
values[vi++] = reader.string();
else
throw Error('illegal wire format for .Package.dependencies');
}
var key;
for (ki = 0; ki < vi; ++ki)
map[typeof (key = keys[ki]) === 'object' ? $toHash(key) : key] = values[ki];
}
message.$values["dependencies"] = map;
break;
case 12:
var length = reader.uint32(), map = {};
if (length) {
length += reader.pos;
var keys = [], values = [], ki = 0, vi = 0;
while (reader.pos < length) {
tag = reader.tag();
if (tag.id === 1)
keys[ki++] = reader.string();
else if (tag.id === 2)
values[vi++] = reader.string();
else
throw Error('illegal wire format for .Package.optionalDependencies');
}
var key;
for (ki = 0; ki < vi; ++ki)
map[typeof (key = keys[ki]) === 'object' ? $toHash(key) : key] = values[ki];
}
message.$values["optionalDependencies"] = map;
break;
case 13:
var length = reader.uint32(), map = {};
if (length) {
length += reader.pos;
var keys = [], values = [], ki = 0, vi = 0;
while (reader.pos < length) {
tag = reader.tag();
if (tag.id === 1)
keys[ki++] = reader.string();
else if (tag.id === 2)
values[vi++] = reader.string();
else
throw Error('illegal wire format for .Package.devDependencies');
}
var key;
for (ki = 0; ki < vi; ++ki)
map[typeof (key = keys[ki]) === 'object' ? $toHash(key) : key] = values[ki];
}
message.$values["devDependencies"] = map;
break;
case 14:
var length = reader.uint32(), map = {};
if (length) {
length += reader.pos;
var keys = [], values = [], ki = 0, vi = 0;
while (reader.pos < length) {
tag = reader.tag();
if (tag.id === 1)
keys[ki++] = reader.string();
else if (tag.id === 2)
values[vi++] = reader.bool();
else
throw Error('illegal wire format for .Package.browser');
}
var key;
for (ki = 0; ki < vi; ++ki)
map[typeof (key = keys[ki]) === 'object' ? $toHash(key) : key] = values[ki];
}
message.$values["browser"] = map;
break;
case 15:
message.$values["types"] = reader.string();
break;
default:
reader.skipType(tag.wireType);
break;
}
}
return message;
}
Woot! The benchmarks show about 2x speedup for decode and about 1.25x for encode over native JSON. That's glorious! Once the library is more feature complete with extensions and oneofs, I have some large deeply nested binary serialized data from our production setup which I would love to benchmark.
Theoretically, oneofs and extensions are already in, but there are not enough test cases yet to make sure it's working.
In current head (45d3e73b03d51ac01a798556f79035cf98f2fd34) I am getting this error using extensions:
Error: duplicate name 'some_option' in Root
Using this protobuf:
syntax = "proto3";
import "google/protobuf/descriptor.proto";
//-------------------------------------------------------------------------------
extend google.protobuf.MessageOptions
{
boolean some_option = 21000;
}
extend google.protobuf.MethodOptions
{
boolean some_option = 21000;
}
extend google.protobuf.ServiceOptions
{
boolean some_option = 21000;
}
//-------------------------------------------------------------------------------
message Something {
option (some_option) = true;
}
service MyService {
option (some_option) = true;
rpc update (Something) returns (Something) {
option (some_option) = false;
}
}
Extension fields seem to be going on Root, but I do have the same name extension field on different options.
As far as I know every extension field is referenced by its unique name within the namespace it is declared in. Are you sure that this is valid?
Interesting, seems like protoc also throws for the duplicate extensions. I could've swore that I've done this before though (I definitely did with v5). Thanks!
Well now that I changed that, it is giving me an error importing google/protobuf/descriptor.proto
. If I comment out that line, the extensions proto compiles, but I get errors when I use a custom option.
This compiles:
syntax = "proto3";
// import "google/protobuf/descriptor.proto";
//-------------------------------------------------------------------------------
extend google.protobuf.MessageOptions
{
bool some_option1 = 21000;
}
extend google.protobuf.MethodOptions
{
bool some_option2 = 21000;
}
extend google.protobuf.ServiceOptions
{
bool some_option3 = 21000;
}
But this does not:
syntax = "proto3";
import "google/protobuf/descriptor.proto";
//-------------------------------------------------------------------------------
extend google.protobuf.MessageOptions
{
bool some_option1 = 21000;
}
extend google.protobuf.MethodOptions
{
bool some_option2 = 21000;
}
extend google.protobuf.ServiceOptions
{
bool some_option3 = 21000;
}
With error:
Error: ENOENT: no such file or directory, open 'C:\Users\Chad**\test.proto\google\protobuf\descriptor.proto' at Error (native)
Additionally, this does not compile:
syntax = "proto3";
// import "google/protobuf/descriptor.proto";
//-------------------------------------------------------------------------------
extend google.protobuf.MessageOptions
{
bool some_option1 = 21000;
}
extend google.protobuf.MethodOptions
{
bool some_option2 = 21000;
}
extend google.protobuf.ServiceOptions
{
bool some_option3 = 21000;
}
//-------------------------------------------------------------------------------
message Something {
option (some_option1) = true;
}
service MyService {
option (some_option3) = true;
rpc update (Something) returns (Something) {
option (some_option2) = false;
}
}
With error:
Error: illegal token 'some_option1' (')' expected, line 25) at Error (native)
Note that the final proto, with the google import enabled works with protoc.
The google import still requires the descriptor file to be present. The internal types are google.prototbuf.Any etc, but don't contain the descriptor. Looks to me like google is also moving away from the descriptor as there are now types, wrappers etc. present outside of it, too.
Attempted to fix the parse error with the last commit.
Well, protoc fails without import "google/protobuf/descriptor.proto";
and succeeds with it. Using protoc v3.0.0. So I would expect that I would need it here too.
With latest head I still get an error:
TypeError: object must be one of Enum, Type, Service, Field, Namespace at TypeError (native)
The object is of type Method
. Looks like it is adding the method to the namespace instead of the service?
Yep, should also be fixed now. Regarding descriptor.proto: Just saying that you still need this externally as it is not included (besides Any etc., which is included).
Awesome, it is parsing now. I'll let you know if anything else comes up, thanks!
Looking good guys. I already do a significant Proto -> JS + TypeScript codegen step in all of my stuff, like here: https://github.com/paralin/grpc-bus/blob/master/scripts/gen_proto.bash
I think I will try to adopt v6 in my development projects and see if I can get some kind of TypeScript interface generation working well. I shouldn't have to do much assuming I can still reflect over everything like before.
Alright guys, I finally decided not to go for any hidden properties on message instances. Instead, every message instance now also is a normal object with non-enumerable getters and setters for virtual oneof fields and plain default values on the prototype - plus the usual non-enumerable $type
.
This also means that you have to take care when setting fields that are part of a oneof, because if you set multiple, the encoder won't know. In such a case, it's now recommended to set the virtual oneof field to the present field's name, which as a side effect clears other fields.
Added a better benchmark test case and it turned out as expected. More precisely it now shows slower encoding but faster decoding than native JSON. If you have any ideas how to fasten up encoding a tiny bit more, let me know!
One more wish list: Please add a method toJSONObject() that directly converts proto to the json obj without needing to encodeJSON and doing JSON.parse.
That works!
+ function bench_protobuf_asjson() {
+ function TestMessage(properties) {
+ protobuf.Prototype.call(this, properties);
+ }
+ protobuf.inherits(TestMessage, Test);
+ var testPb = new TestMessage(data);
+
+ var start = Date.now();
+ for (var i = 0; i < times; ++i) {
+ var json = testPb.asJSON({ long: Number, enum: String });
+ }
+ summarize("encode protobuf." + "js asJson(Number, String)", start, '');
+ var start = Date.now();
+ for (var i = 0; i < times; ++i) {
+ var json = testPb.asJSON({ long: Number, enum: Number });
+ }
+ summarize("encode protobuf." + "js asJson(Number, Number)", start, '');
+ console.log();
+ var start = Date.now();
+ for (var i = 0; i < times; ++i) {
+ var json = testPb.asJSON({ long: String, enum: Number });
+ }
+ summarize("encode protobuf." + "js asJson(String, Number)", start, '');
+ console.log();
+ var start = Date.now();
+ for (var i = 0; i < times; ++i) {
+ var json = testPb.asJSON({ long: String, enum: String });
+ }
+ summarize("encode protobuf." + "js asJson(String, String)", start, '');
+ console.log();
+ }
+
encoding/decoding 300000 iterations ...
encode protobuf.js : 1119ms 58200000 bytes
decode protobuf.js : 367ms 58200000 bytes
encode protobuf.js r/w : 1100ms 58200000 bytes
decode protobuf.js r/w : 271ms 58200000 bytes
encode protobuf.js asJson(Number, String) : 208ms bytes
encode protobuf.js asJson(Number, Number) : 197ms bytes
encode protobuf.js asJson(String, Number) : 192ms bytes
encode protobuf.js asJson(String, String) : 197ms bytes
--- warmed up ---
encode protobuf.js : 1019ms 58200000 bytes
decode protobuf.js : 297ms 58200000 bytes
encode protobuf.js r/w : 983ms 58200000 bytes
decode protobuf.js r/w : 268ms 58200000 bytes
encode protobuf.js asJson(Number, String) : 199ms bytes
encode protobuf.js asJson(Number, Number) : 196ms bytes
encode protobuf.js asJson(String, Number) : 198ms bytes
encode protobuf.js asJson(String, String) : 200ms bytes
Note that asJSON
takes String
and Number
(the global functions/types) and not strings as a parameter.
What exactly are you trying to accomplish there again?
Ugh. I was trying 4 different variations (number long, string longs, number enum and string enum), will update the benchmark.
Updated the benchmark (above) performance is similar. Just shows decode + asJSON is faster than JSON.parse which is no brainer for many who wishes to keep the code in JSON while using PB to transport
asJSON basically just makes sure that you get an object that is properly JSON.stringifyable. If you do not intend to stringify it, you can just access the properties on the message without calling asJSON at all. Every message also is an object with all the fields as properties.
Yes that works on many parts of our app that is purely in JS.
The backstory of the trouble we have is
So to work around these things that we had to keep the redux state in pure JSON and convert the proto before it is set on the redux store. .
Please also take this into account if you consider contributing.
Key points:
Element#verifyValue
and similar stuffWhy?
Will create a new branch once I have it working to some extend. Looking forward to read your ideas, especially on code generation.