microsoft / bond

Bond is a cross-platform framework for working with schematized data. It supports cross-language de/serialization and powerful generic mechanisms for efficiently manipulating data. Bond is broadly used at Microsoft in high scale services.
MIT License
2.61k stars 322 forks source link

Runtime format validation of a serialized object (using tagged protocol) #973

Open sofusmortensen opened 5 years ago

sofusmortensen commented 5 years ago

I would like to be able to validate the format of a serialized object at runtime using the RuntimeSchema instead of compile time information. I am thinking it would be a dry run of the deserializer but using only the schema.

As far as I can tell it can't be done at present, and it even seems the deserializer is ignorant of the supplied schema (at least that's the case when using the tagged format where it does sort-of make sense).

Any good ideas for how to approach this?

sapek commented 5 years ago

Runtime schema is not ignored. You can write a transform for your validation and apply it to the payload using runtime schema. In the transform you will get Field, OmittedField and UnknownField calls for fields that are respectively present in payload and in schema, present in schema but not payload and present in payload but not schema.

Having said that, depending on what you want to achieve with validation there maybe simpler solution. E.g. by using required fields you can get deserializer to automatically validate that those fields are present in the payload.

sofusmortensen commented 5 years ago

@sapek, thanks.

I am not quite sure that helps in my case. I assume transform corresponds to Transcoder in the C# version? I must say I haven't looked at C++ api at all yet.

In my use case, I want to to write some server side code that can validate the form of a serialized message using only the RuntimeSchema - ie. the type is unknown at compile time, and the RuntimeSchema will be persisted in sqlserver or similar.

I believe you can do this in he Java library for Protocol buffers using DynamicMessage, which additionally will give dynamic access to actual content.

sapek commented 5 years ago

C# has corresponding DeserializerTransform and SerializerTransform however they are currently marked internal. Additionally transform in C# operate at expression tree level, which makes them very powerful/flexible, but also non-trivial to use.

One simple option for your scenario might be to transcode the message to JSON and perform your validation on the JSON.

sofusmortensen commented 5 years ago

Yeah - I see now I could transcode to JSON, and work my way through there manually using the RuntimeSchema. I'd expect that to painfully slow.

Hmm. DeserializerTransform looks very interesting. I assume the philosophy is to build up the expression tree to do the work and compile it just once. I am concerned/confused here in Generate about the need of both IParser (derived from RuntimeSchema) and the Type. In my case would have the Schema but no Type.

sapek commented 5 years ago

In your case you would probably write a SerializerTransform. The concept is similar to what I've described for C++ API, you'd generate expressions for fields, omitted fields and unknown field.

sofusmortensen commented 5 years ago

Cool - I am all up for doing that.

But why use SerializerTransform and DeserializerTransform as basis? I would assume the deserialization process is more kindred.

sapek commented 5 years ago

If you squint, all operation in Bond are transformation of one materialization of a schema into another materialization of a (possibly different) schema. There are two high level concepts that all such operations are composed of:

A parser decomposes a materialization of a schema producing a "stream of tokens" and a transform consumes that stream. All built-in transforms produce another materialization of a schema but nothing in principle prevents a transform from reducing the stream into let's say a bool value.

Some examples of concrete manifestations of these concepts that are included in Bond:

There are also parsers for untagged prototols, JSON and Xml that can be used in similar combinations.

If you look at DeserializerTransform you will notice that majority of the code deals with instantiating the object that is being deserialized and its fields. None of that applies in your case. I think you would use the TaggedParser with runtime schema and write a relatively simple transform to handle omitted and unknown fields appropriately. SerializerTransform is the simplest built-in transform which is why I suggested using it as an example.

sofusmortensen commented 5 years ago

Thanks @sapek

I have just managed to get an early version working. Took quite some time wrapping my head around the whole expression trees / transformer stuff, and there are still a few open ends and much to test, but I am totally confident now I can make it work.

I did it all outside Bond, which was pretty because there are a few of internal helper classes (like Transform and Reflection) that I needed access to and had to cherry pick from instead. The code is 300 lines or so.

I am going to tidy this up for my own project, but would like to evolve it into a PR for Bond as well. If there is an interest in this?

sapek commented 5 years ago

If you could refactor your code on top of Bond transforms infrastructure then it might be a great example to pull into Bond. There is another issue #970 asking about making some of the transform APIs public...

sofusmortensen commented 5 years ago

I believe the runtime validation would be a valuable addition to the Bonditself. But making more tranform api public would slim down my code considerably.

BTW, similar to protocol buffers DynamicMessage, I think it would be pretty useful to be able to deserialize to Dynamic.ExpandoObject (given a schema).

sapek commented 5 years ago

There are several reasons why I think schema validation would be better as an example:

Deserializing into Dynamic.ExpandoObject however would be a fantastic addition to C# Bond itself.

sofusmortensen commented 5 years ago

@sapek ahh of course - I am forgetting about the other client languages. Completely reasonable.

I know what to do then.

sofusmortensen commented 5 years ago

@sapek I have opened a PR with an example doing runtime schema validation - which also involves making a lot of internal classes public.

https://github.com/microsoft/bond/pull/974