user definable mapping: FQNC's --> integer, for serialized stream size-reduction

GoogleCodeExporter commented 8 years ago

Request for additional functionality that allows to define a mapping between 
FQNC's and int's on a per schema basis.

One then could give this mapping along to protostuff for reducing the size of 
the serialized stream.

See this thread for details:
http://groups.google.com/group/protostuff/browse_thread/thread/472b52c56556c207?
hl=en_US

Original issue reported on code.google.com by hpgisler...@gmail.com on 2 Dec 2011 at 7:16

GoogleCodeExporter commented 8 years ago

Hi, 

Please find attached the initial version of the patch, as described in the 
following thread:
https://groups.google.com/forum/?fromgroups#!topic/protostuff/1W2_v3wp4F8

It implements the encoding of FQNCs using numeric ids. This is done for objects 
of types unknown in advance, but also for all maps, collections, arrays and 
enums.

There are two new classes:  IdentifierStrategy and NumIdIdentifierStrategy that 
encapsulate the logic for reading and writing classnames.

The IdentifierStrategy class implements the old behavior. 
The NumIdIdentifierStrategy class implements the new behavior and uses numric 
ids.

A test case can be found in ObjectSchemaWithIdsTest.

To register mappings, one needs to call
 RuntimeSchema.setClassId(classname, id)

The mapping will be only applied, if the following is called:
  IdentifierStrategy.setStrategy(new NumIdIdentifierStrategy());

TODOs:
- Better way to set/initialize the IdentifierStrategy (see the mentioned thread 
for some ideas)
- Put IdentifierStrategy as a field into the Schema to avoid map lookups during 
serialization?
- More efficient encoding of identifiers in case they are not registered by any 
mapping (now we waste 32bits for keeping the 0 value, indicating that an 
identifier follows; and we use and additional field number, because I have not 
found a way to output a string to Output without a field number)
- More efficient encoding of numeric ids (something like variable length 
VarInt32 encoding?)
- Backwards compatibility? Ability to read both old and new encodings from the 
same Input?

Original comment by romixlev on 13 Feb 2012 at 4:27

Attachments:

protostuff.patch

GoogleCodeExporter commented 8 years ago

An updated version of the patch.

It moves all maps and mapping related function into IdentifierStrategy class 
and its sub-classes, which is much cleaner.

It also introduces a new AutoNumIdIdentifierStrategy class, which does not 
require a dedicated registration in advance and does it automatically, when a 
new identifier occurs for the first time.

Original comment by romixlev on 14 Feb 2012 at 10:06

Attachments:

protostuff-v2.patch

GoogleCodeExporter commented 8 years ago

Thanks for the patch!

A few remarks:
1.  Left-curly brackets not *consistently* on a new line)
2.  IdentifierStrategy as abstract class, with DefaultIdentifierStrategy being 
the default impl (writes FQCNs and does not include the concurrenthashmap).
3.  The AutoNumIdIdentifierStrategy is better off using an array list rather 
than a map<integer,string>

default = no maps, num = map<int,string>, autonum = list<string>
We can also introduce "strict", w/c will throw an exception if it encounters an 
unregistered message.  This will avoid bugs, security issues.

"More efficient encoding of numeric ids (something like variable length 
VarInt32 encoding?)" - underneath it already is varint encoded.

Additional TODOs:
- Avoid having to do map lookups (id) during serialization (applies to 
enums/collections/maps/pojos).
- make it possible to register/configure custom collections/maps (with ids as 
well)

I think this is a good head start.

Original comment by david.yu...@gmail.com on 14 Feb 2012 at 3:02

GoogleCodeExporter commented 8 years ago

> 1.  Left-curly brackets not *consistently* on a new line)
Fixed.

> 2.  IdentifierStrategy as abstract class, with DefaultIdentifierStrategy 
being the 
> default impl (writes FQCNs and does not include the concurrenthashmap).

Fixed.

> 3.  The AutoNumIdIdentifierStrategy is better off using an array list rather 
than 
> a map<integer,string>

Fixed.

> default = no maps, num = map<int,string>, autonum = list<string>
Done. I think, we could also use list<String> for num, in principle. In the 
worst case (big id is assigned), we just grow the ArrayList.

> We can also introduce "strict", w/c will throw an exception if it encounters 
an 
> unregistered message.  This will avoid bugs, security issues.

> Additional TODOs:
> - Avoid having to do map lookups (id) during serialization (applies to 
> enums/collections/maps/pojos).

I'm not sure how to do this technically and semantically. We need to put info 
about the id into schemas, I guess. But the problem is: schemas do not even 
know the name of the classes they relate to. Then Enums don't have schemas, as 
you mentioned. And last, but not least, I don't like the tight coupling between 
schemas and identifiers and having only one global identifier numbering. The 
reason is: imagine that we have 2(or more) apps using protostuff and written by 
different people at different time. It could very well be that they use 
inconsistent numeric ids, i.e. everyone introduced his own numbering. Or even 
in case of automatic numbering, it is different due to different class-loading 
order. And now imagine that those apps need to exchange information using 
protostuff encoding. We can either force them to use the same numbering or we 
can try to support different numberings for the schemas. E.g. we use one 
numbering for incoming messages from a given partner and another for outgoing. 
To solve this and also to be able to propagate the IdentifierStrategy, I think 
it could be a good idea to associate it with something that is always 
propagated anyway, e.g. with Inputs or Outputs. Which means that we'll get 
stream specific numbering, which makes a lot of sense. Of course, of no 
explicit IdentifierStrategy is associated with a stream it can take a global 
one as a default.
What do you think? If you like the idea of IdentifierStrategy associated with 
streams, what would be the way to achieve it?

> - make it possible to register/configure custom collections/maps (with ids as 
well)
Ids can be registered for anything as they are not connected to schemas in any 
form currently. Ability to register/configure custom collection/maps is useful, 
but it is a bit beyond of scope of this issue, IMHO, and will require many more 
changes.

> - Put IdentifierStrategy as a field into the Schema to avoid map lookups 
during 
> serialization?

See my considerations above.

> - More efficient encoding of identifiers in case they are not registered by 
any 
> mapping (now we waste 32bits for keeping the 0 value, indicating that an 
> identifier follows; 

> and we use and additional field number, because I have not found a way to 
output a 
> string to Output without a field number)

Any idea how to achieve this (String without field number in front of it?) for 
the NumIdIdentifierStrategy?

Original comment by romixlev on 15 Feb 2012 at 3:33

Attachments:

protostuff-v3.patch

GoogleCodeExporter commented 8 years ago

> - More efficient encoding of identifiers in case they are not registered by 
any 
> mapping (now we waste 32bits for keeping the 0 value, indicating that an 
> identifier follows; 
Not really 32bits as it is varint encoded.  

The approach is to use a different field number to write the int id (rather 
than same as FQCN, 127).

Basically on ser:
If registered, writeInt32(126, id, false);
If not, writeString(127, fqcn, false);

On deser, if 126, registered .. if 127, fqcn, else throw exception.

This way, no extra field is written (wasted).

Original comment by david.yu...@gmail.com on 16 Feb 2012 at 3:55

GoogleCodeExporter commented 8 years ago

> The approach is to use a different field number to write the int id (rather 
than 
> same as FQCN, 127).

> Basically on ser:
> If registered, writeInt32(126, id, false);
> If not, writeString(127, fqcn, false);

> On deser, if 126, registered .. if 127, fqcn, else throw exception.

>This way, no extra field is written (wasted).

I see your point. But it is a bit more complicated. We use IdentifierStrategy 
not only for ID_POJO(127), but also for ID_MAP, ID_COLLECTION, ID_ENUM, etc. 
Therefore, we need to introduce new tags for them as well, if we follow your 
approach. And if we later introduce any new ID_somthing, we'll need to 
introduce an additional code for its numid encoded field number.... I'm not 
100% convinced it is a best possible approach. It couples id encoding with 
fields, their nums and their semantics...
Do you still say we should do it this way or are there any alternatives?

P.S. Any comments on other questions? Especially on relationship between 
strategies and schemas or streams?

Original comment by romixlev on 16 Feb 2012 at 5:34

GoogleCodeExporter commented 8 years ago

Regarding the relationship between strategies and schemas, I think we can put 
an extra id field (int) on the RuntimeSchema to avoid extra lookups.  Still I 
prefer this be implemented together with enums/collections/maps as a unified 
way and not just for pojos.

"And if we later introduce any new ID_somthing, we'll need to introduce an 
additional code for its numid encoded field number.... I'm not 100% convinced 
it is a best possible approach. It couples id encoding with fields, their nums 
and their semantics...
Do you still say we should do it this way or are there any alternatives?"

We'll only implement that for pojos (127='_' fqcn, 126='$' id) since that is 
the most common.
To ensure that there won't be a waste in enums/collections/maps, one can employ 
a StrictNumStrategy (can be a good reminder for a developer to register).

Original comment by david.yu...@gmail.com on 16 Feb 2012 at 6:13

GoogleCodeExporter commented 8 years ago

'$' does not seem to be valid xml name.
Btw, don't you prefer to use StrictNumStrategy? (It can be for security 
purposes as well.)
Others who don't use it are basically saying they can live with FQCNs.

Its basically all or nothing.

Original comment by david.yu...@gmail.com on 16 Feb 2012 at 6:28

GoogleCodeExporter commented 8 years ago

> We'll only implement that for pojos (127='_' fqcn, 126='$' id) since that is 
the 
> most common.
> To ensure that there won't be a waste in enums/collections/maps, one can 
employ a 
> StrictNumStrategy (can be a good reminder for a developer to register). 

OK. I can do it this way.

> Btw, don't you prefer to use StrictNumStrategy? (It can be for security 
purposes 
> as well.) Others who don't use it are basically saying they can live with 
FQCNs.

I like the StrictNumStrategy, which I'll add today ;-) 
My only problem with it is when serilization and deserialization is controlled 
by different parties and uses potentially different numberings, if at all (and 
this is very common for networking applications, where different independently 
developed nodes talk to each other). In those cases StrictNumStrategy does not 
help that much. In many such cases, you actually want the serializing party to 
tell the deserializing party how to deserialize. This was a bit the motivation 
for associating IdentifierStrategy with streams, not with schemas. But for 
those cases, where it is not an issue, I agree that StrictNumStrategy is a good 
default choice.

> '$' does not seem to be valid xml name.
Is it a requirement that field name is a valid xml name? If so, what do you 
propose?

> Regarding the relationship between strategies and schemas, I think we can put 
an 
> extra id field (int) on the RuntimeSchema to avoid extra lookups.  

Yes that could be done. How do you propose to initialize it? When a mapping is 
being registered? I.e. if we register a new mapping and a schema exists 
already, we set this field. Or if schema is created after registration, we 
check during its creation which numid is assigned to the corresponding fqcn? 

BTW, is it guaranteed that there is at most one fqcn that refers to any given 
RuntimeSchema or could they be shared somehow?

> Still I prefer this be implemented together with enums/collections/maps as a 
> unified way and not just for pojos.

I'd like to see it as well. But I don't yet see how to do it in a nice simple 
way.

Original comment by romixlev on 16 Feb 2012 at 7:07

GoogleCodeExporter commented 8 years ago

- Added a strict policy
- Updated the JUnit test
- Implemented this (slightly different): "The approach is to use a different 
field number to write the int id (rather than same as FQCN, 127)"

Original comment by romixlev on 16 Feb 2012 at 1:48

Attachments:

protostuff-v4.patch

GoogleCodeExporter commented 8 years ago

Hi David,

Any comments? How should we proceed?

Original comment by romixlev on 20 Feb 2012 at 9:54

GoogleCodeExporter commented 8 years ago

"> Still I prefer this be implemented together with enums/collections/maps as a 
> unified way and not just for pojos.

I'd like to see it as well. But I don't yet see how to do it in a nice simple 
way."

An interface "HasId" could be added to protostuff-api, which could be 
implemented by the components.  Setting the id could be done after creation.

For enums, EnumIO would implement HasId.
The strategy will need ArrayList<EnumIO> for the id index (deser optimization).

For pojos, HasSchema would implement HasId.
The strategy will need ArrayList<HasSchema> for the id index (deser 
optimization).

For collections/maps, the CollectionSchema.MessageFactory would implement HasId.
As well as MapSchema.MessageFactory.
The strategy will need ArrayList (each) for the id index (deser optimization).

I believe that will provide the best performance as it eliminates the extra 
lookups on ser (string-to-id) and deser (id-to-string).  

Any scenarios that I might have overlooked?

Original comment by david.yu...@gmail.com on 20 Feb 2012 at 2:00

GoogleCodeExporter commented 8 years ago

Hi David,

Thanks for the feedback.

> Any scenarios that I might have overlooked?

Well, I outlined the reasons for not associating ids to schemas directly. But 
obviously, you have a different opinion ;-) 

I work for a company which does a lot with distributed systems coming from 
different vendors. Same encoding/numbering cannot be always ensured. Each 
client may be implemented/operated by a different 3rd party, which may have its 
own numbering of FQCN ids. Some of their classes are not even known to us in 
advance, as they can use our systems to exchange any kind of information, 
including their own classes (think about event buses, data grids and similar 
frameworks). We would usually receive their serialized messages over a network, 
do something with it and eventually forward to a different client (which may 
use a different numbering schema). We have no easy way to enforce a unique 
non-conflicting numbering. Therefore, I suggested that numbering should be 
associated to streams, because this way you associate it with a specific stream 
coming from a specific client and one specific client is hopefully using an 
internally consistent way of serialization for its own messages.

How would you cover such a use-case with your proposed approach? Or is it out 
of scope for protostuff?

Original comment by romixlev on 20 Feb 2012 at 3:00

GoogleCodeExporter commented 8 years ago

How would you cover such a use-case with your proposed approach? Or is it out 
of scope for protostuff?

There is a way but I dunno if it will perform.
We can use a ThreadLocal and the ObjectSchema will use that to check if there 
was a strategy to use for ser and deser.

It simply cannot be done on the stream itself since it has no methods like 
readIdFromInput/etc/etc.

Original comment by david.yu...@gmail.com on 21 Feb 2012 at 3:49

GoogleCodeExporter commented 8 years ago

> There is a way but I dunno if it will perform.
> We can use a ThreadLocal and the ObjectSchema will use that to check if there 
was 
> a strategy to use for ser and deser.

I don't think this will scale...

> It simply cannot be done on the stream itself since it has no methods like 
> readIdFromInput/etc/etc.

I think I have a possible solution.

Let's auto-assign a numeric id to each schema registered locally (this is not 
the id used during serialization!). Let's call id schemaId. We can easily 
set/get it using getSchemaId/setSchemaId using the same approach you described 
before.

And then, in each IdentifierStrategy we have a mapping from SchemaId to 
IdentifierId. The later one is used for serialization/deserialization. Since we 
map int to int, we can use e.g. ArrayList and be very efficient.
We can also have a reverse mapping from IdentifierId to SchemaId.

Then, we add two methods to each stream (Input or Output): 
setIdentifierStrategy/getIdentifierStrategy. They would associate a custom 
strategy (i.e. custom numbering) with a given stream. By default, if no 
strategy is provided, a stream would use the global one.

Now, if we need to serialize a given object O belonging to schema S using a 
stream IO we do:
- get schemaId from schema (simple getter in Schema object)
- get IdentifierStrategy to be used ( using IO.getIdentifierStrategy)
- call IdentifierStrategy.getIdentifierIdForSchemaId(schemaId) - indexed access 
to an ArrayList
- write this identifierId into the stream

And on deserialization we do:
- get IdentifierStrategy to be used ( using IO.getIdentifierStrategy)
- read identifierId from a stream using this strategy
- call IdentifierStrategy.getSchemaIdFromIdentifierId(identifierId) or 
IdentifierStrategy.getSchemaFromIdentifierId(identifierId) depending on what we 
need.

So, in both cases, we never do any expensive lookups using maps. And this 
approach combines your proposal with a flexibility of being able to use 
different encoding with different streams. This is just an idea. I haven't 
implemented it yet.

What do you think about it? Something that I overlooked? Would it work?

Original comment by romixlev on 21 Feb 2012 at 4:31

GoogleCodeExporter commented 8 years ago

"Then, we add two methods to each stream (Input or Output): 
setIdentifierStrategy/getIdentifierStrategy. They would associate a custom 
strategy (i.e. custom numbering) with a given stream. By default, if no 
strategy is provided, a stream would use the global one."

Not a good idea.  The input/output should not be concerned/associated with ids.
They should be as stateless as possible, which makes it easier to add new 
formats.  Plus, vanilla protostuff/protobuf does not even need ids for messages.
Any other ideas?

Original comment by david.yu...@gmail.com on 21 Feb 2012 at 11:39

GoogleCodeExporter commented 8 years ago

> Not a good idea.  The input/output should not be concerned/associated with 
ids.
> They should be as stateless as possible, which makes it easier to add new 
> formats.  
> Plus, vanilla protostuff/protobuf does not even need ids for messages.
> Any other ideas?

1) Introduce an external mapping using IO stream objects(or may be its internal 
stream id, that we could introduce? 
Is it OK to introduce getId/setId as methods to auto-assign/read unique stream 
ids? This would still keep streams rather stateless) 
as a key and mapping it to contexts, which include IdentifierStrategy. 
Depending on the implementation's effectiveness of such a mapping, the look-ups 
could be rather costly or fast.

2) Introduce an optional "context" parameter to each read/write operation 
invoked on schema (and streams?). 
This context can propagate a lot of information around, including  
IdentifierStrategy reference.
Of course, this method is very intrusive and would require a lot of API changes 
and refactoring due to 
introduction of this parameter.

The mentioned ThreadLocal is an option for context propagation, but not a very 
scalable one, IMHO. 
Especially if you have thread pools and thousands of Runnables, each using its 
own IdentifierStrategy (and context).
Then you'll spend a lot of time just setting/reading those ThreadLocal settings.

Do you see any other scalable way to associate different strategies with with 
streams?

Original comment by romixlev on 22 Feb 2012 at 8:39

GoogleCodeExporter commented 8 years ago

One more thinko in addition to mentioned above:

Imagine that we do not change the interfaces for Input/Output. 
Instead we define a new interface HasId. Some of the implementations of 
Input/Output/Schema may implement it and provide their id. Some others - may 
not. 
Every time we need to figure out a strategy specific for a given stream, we 
check if it is an instance of HasId. If so, we take its id and do mapping from 
id to context related to this stream (it can be done efficiently, as it is an 
integer). If this is not a HasId impl, we simply use the global context, i.e. 
the global IdentifierStrategy in our case.

This way, we have minimal changes to interfaces and only slightly enhance 
implementations of existing protostuff IO streams.

Original comment by romixlev on 22 Feb 2012 at 2:55

GoogleCodeExporter commented 8 years ago

I'm not too fond of adding these to Input/Output (its only job should be 
stateless read/write).

The one I suggested earlier would allow for unified impl for ids as well as 
maintain performance.  

You have a special usecase in which it might be difficult to integrate 
unobtrusively.

The goal is performance + unobtrusive.

Original comment by david.yu...@gmail.com on 26 Feb 2012 at 5:21

GoogleCodeExporter commented 8 years ago

Hi David,

> I'm not too fond of adding these to Input/Output (its only job should be 
stateless 
> read/write).

My last proposals WOULD NOT add ANY state to streams. It would add only unique 
immutable stream identifiers (they have them in fact already as internal JVM 
object ids, but these ones are not very handy for mapping using ArrayList). So, 
streams remain as stateless, as they were before. But anyway, let's skip it for 
now, as you don't like it ...

> The one I suggested earlier would allow for unified impl for ids as well as 
> maintain performance. 

I'm trying now to implement your proposal, but face some issues. Probably I 
don't quite understand some bits of it.

> An interface "HasId" could be added to protostuff-api, which could be 
implemented 
> by the components.  Setting the id could be done after creation.

Done.

> For enums, EnumIO would implement HasId.
Done.

> The strategy will need ArrayList<EnumIO> for the id index (deser 
optimization).
Added this mapping.

> For pojos, HasSchema would implement HasId.
Done.

> The strategy will need ArrayList<HasSchema> for the id index (deser 
optimization).
Added this mapping.

> For collections/maps, the CollectionSchema.MessageFactory would implement 
HasId.
> As well as MapSchema.MessageFactory.
Done.

> The strategy will need ArrayList (each) for the id index (deser optimization).
Added this mapping.

BTW, implementing HasId at X places using exactly the same code-snippet looks 
pretty ugly, especially for CollectionSchema.MessageFactory and 
MapSchema.MessageFactory, because each enumeration value needs to implement it.

> I believe that will provide the best performance as it eliminates the extra 
> lookups on ser (string-to-id) and deser (id-to-string). 

I see how to use it for serialization. If there is a HasId schema/EnumIO/etc 
passed as a param, we just use HasId.getId() and obtain the numeric id without 
any expensive look-up based on the fqcn. (BTW, we only pass in the HasId in 
case of POJOs mostly. In all other cases, we pass only the FQCN. If we need to 
pass a HasId for Enum or Map, we'd need to do a lookup in the caller method. 
So, what do we gain then?)

But what about de-serialization? Currently, the code in ObjectSchema/etc looks 
like this:
 String identifier = IdentifierStrategy.readIdentifier(); 
 // And then eventually (but not for Arrays) lookup a Schema or EnumIO or
 // CollectionSchema.MessageFactory by this identifier

For Maps and Collections it is even more complex and may decide to look-up 
using EnumIO or MessageFactory depending on the identifier (see 
ObjectSchema.readObjectFrom)

So, just to clarify, do you suggest to replace such snippets by e.g.
HasSchema wapper = IdentifierStrategy.readIdentifierGetWrapper() ?

If this is the plan, what should be the return type of such a method? If we 
want to have only one method to rule them all, we have a problem, as Schema, 
EnumIO and MessageFactory do not have a common base type. So, do we want to 
return Object? Or do we want to have a set of methods for reading a (numeric) 
id and returning a required data structure? For example something like this:
IdentifierStrategy.readIdentifierGetSchema();
IdentifierStrategy.readIdentifierGetEnumIO();
IdentifierStrategy.readIdentifierGetCollectionMessageFactory();
IdentifierStrategy.readIdentifierGetMapMessageFactory();

Overall, I have a bit of the feeling that all the added complexity does not 
justify the potential advantages that are yet to be proven.

Would be nice if you could clarify on these topics, so that I can proceed.

Original comment by romixlev on 27 Feb 2012 at 12:47

GoogleCodeExporter commented 8 years ago

Looking at the patch (ObjectSchema), you'll notice that IdentifierStrategy's 
delegate method (readIdentifier) returns a string.  The change needed is to 
return the HasSchema for pojos, EnumIO for enums, 
CollectionSchema.MessageFactory for collections and MapSchema.MessageFactory 
for maps.

The DefaultStrategy would contain the old impl while the NumStrategy will read 
the int id and return the wrapper from its index (ArrayList).

Original comment by david.yu...@gmail.com on 28 Feb 2012 at 2:36

GoogleCodeExporter commented 8 years ago

Hi David,

1)

> The DefaultStrategy would contain the old impl while the NumStrategy will 
read the 
> int id and return the wrapper from its index (ArrayList).

Yes. This is exactly how I started implementing it yesterday.

> Looking at the patch (ObjectSchema), you'll notice that IdentifierStrategy's 
> delegate method (readIdentifier) returns a string.  The change needed is to 
return 
> the HasSchema for pojos, EnumIO for enums, CollectionSchema.MessageFactory 
for 
> collections and MapSchema.MessageFactory for maps.

Yes. But this is what I already stated in my question, or? 

So, do you want only one method that returns an Object (depending on the kind 
of id), that needs to be casted to HasSchema, EnumIO or one of the 
MessageFactories respectively in the ObjectSchema and other clients of this 
API? Is it a correct understanding?

2) Another question. Here is a snippet from ObjectSchema class

            case ID_COLLECTION:
                final String collectionType = IdentifierStrategy.getInstance()
                    .readIdentifier(input, number, schema);
                if(collectionType.indexOf('.') != -1)
                {
                    // EnumSet
                    final Collection<?> c = EnumIO.get(collectionType, true).newEnumSet();

                    if(input instanceof GraphInput)
                    {
                        // update the actual reference.
                        ((GraphInput)input).updateLast(c, owner);
                    }

                    COLLECTION_SCHEMA.mergeFrom(input, (Collection<Object>)c);
                    return c;
                }

                final Collection<Object> collection = 
                    CollectionSchema.MessageFactories.getFactory(
                            collectionType).newMessage();

                if(input instanceof GraphInput)
                {
                    // update the actual reference.
                    ((GraphInput)input).updateLast(collection, owner);
                }

                COLLECTION_SCHEMA.mergeFrom(input, collection);

                return collection;

            case ID_MAP:
                final String mapType = IdentifierStrategy.getInstance()
                    .readIdentifier(input, number, schema);
                if(mapType.indexOf('.') != -1)
                {
                    // EnumMap
                    final Map<?,Object> m = EnumIO.get(mapType, true).newEnumMap();

                    if(input instanceof GraphInput)
                    {
                        // update the actual reference.
                        ((GraphInput)input).updateLast(m, owner);
                    }

                    MAP_SCHEMA.mergeFrom(input, (Map<Object, Object>)m);
                    return m;
                }

                final Map<Object,Object> map = 
                    MapSchema.MessageFactories.getFactory(mapType).newMessage();

                if(input instanceof GraphInput)
                {
                    // update the actual reference.
                    ((GraphInput)input).updateLast(map, owner);
                }

                MAP_SCHEMA.mergeFrom(input, map);

                return map;

As you can see, in the case of maps and collections, it is not known in advance 
whether we need MessageFactories or EnumIO. So, how should the readIdentifier 
implementation know what to return in this case? 

3) Arrays just write/read the string identifier. But there is no explicit 
wrapper for Array objects. What do we do with it? Or should ArrayWrapper 
implement HasId to follow other kinds of schemas?

Original comment by romixlev on 28 Feb 2012 at 5:06

GoogleCodeExporter commented 8 years ago

1. A method for each type (pojo/enum/collection/map).
E.g
HasSchema<Object> hs = strategy.resolvePojo(...)
EnumIO<?> eio = strategy.resolveEnum(...)

2. strategy.newMap would return Map.  strategy.newCollection would return 
Collection.
You will know if the component is an enum because of id's first two bits.

3. We could implement it later after this.

Original comment by david.yu...@gmail.com on 28 Feb 2012 at 11:16

GoogleCodeExporter commented 8 years ago

> 2. strategy.newMap would return Map.  strategy.newCollection would return 
> Collection.
> You will know if the component is an enum because of id's first two bits.

Was a bit busy with other things, but now I'm almost finished. Still have a 
question here, because I do not quite understand "You will know if the 
component is an enum because of id's first two bits":

When I've read ID_COLLECTION or ID_MAP in the input stream in 
ObjectSchema.readObjectFrom() method, I haven't seen the numeric id that 
follows it yet! So, I don't know in readObjectFrom if I'll need to lookup via 
EnumIO or a proper MessageFactories. And I don't have the first two bits of the 
numeric id yet to decide.

Therefore, I see the following alternatives:
1) First read the identifier as string (because default policy cannot read it 
as integer) and then decide what to do - this is what is done now

2) Introduce in IdentifierStrategy methods resolveMapOrEnum and 
resolveCollectionOrEnum. They would read numeric/string id, do lookup or use 
first 2 bits, then determine if it is enum or map/collection and return a 
corresponding object, which needs to be casted in ObjectSchema into something 
appropriate.

3) Store ID_MAP and ID_COLLECTION in a special way: if the identifier requires 
enumio lookup, use new tags ID_MAP_ENUM and ID_COLLECTION_ENUM respectively in 
writeObjectTo. Then, in readObjectFrom have special switch cases for 
ID_MAP_ENUM and ID_COLLECTION_ENUM.

Or may be I miss something and still misunderstand you?

Thanks,
  Leo

Original comment by romixlev on 2 Mar 2012 at 3:11

GoogleCodeExporter commented 8 years ago

I see you've been busy on the jmeter plugins :-)  We've a lot of groups in 
common hehe.

I haven't looked though the enum handling.  Your third suggestion sounds good.
The important thing is the strategy needs to know whether to return EnumMap or 
EnumSet.
ObjectSchema can simply receive those as Map and Collection without any extra 
handling.

Original comment by david.yu...@gmail.com on 2 Mar 2012 at 8:02

GoogleCodeExporter commented 8 years ago

Hi,

I implemented the the handling of POJOs, paps, collections and enums as 
discussed. We have a HasId interface now and serialization, de-serialization is 
done avoiding any expensive look-ups. It uses only indexed access to ArrayLists.

What could be improved:
1) It uses different ArrayLists for enums, pojos, collections, maps. May be a 
single ArrayList  can be used using the shitfing by 2 bits trick, which would 
allow to encode the kind of entry using a 2bit tag. But this is a minor issue, 
as it does not affect performance. Only the memory consumption will be improved.

2) The fact that ids of types are stored in the HasId implementations makes it 
difficult (if not impossible) to use different IdentifierStrategies with 
different numeric ids for the same type. A possible workaround could be to 
treat the ids stored in HasId implementations as internal unique ids of those 
types and then have another mapping from those ids to a 
IdentifierStrategy-specific numeric ids. It would require one additional 
ArrayList<Integer> (for more details see my proposal in the previous comments)

3) More unit tests would be nice. In particular, I'm not 100% confident about 
ObjectSchema.transferObject and a few other functions.

-Leo

Original comment by romixlev on 6 Mar 2012 at 2:38

Attachments:

protostuff-v5.patch

GoogleCodeExporter commented 8 years ago

1. Sure.  But as you said, its a non issue.

2. Yep.  You can easily have a strategy w/c maps the internal ids (or fqcn) to 
another set of ids.

3. Definitely. This will be a good feature for 1.0.5

So I'm guessing you've measured the performance gains with this approach 
compared to the last one (or the stock one)?  It should be scream.

I'm thinking of making the strategy static final (non-switchable) once 
protostuff-runtime is loaded (RuntimeEnv).  Configuration will be via system 
properties (ala RuntimeEnv).  This will be for concurrent 
correctness/performance.  What do you think?

By using a different classloader, you can load 2 instances of 
protostuff-runtime (via jboss-modules) and have your main application 
communicate with use both (I might be wrong here).

Anyway, good job!  I'll take a closer look at the patch.

Original comment by david.yu...@gmail.com on 6 Mar 2012 at 4:58

GoogleCodeExporter commented 8 years ago

> So I'm guessing you've measured the performance gains with this approach 
compared 
> to the last one (or the stock one)?  It should be scream.

Actually, I haven't measured the performance. Didn't have time and I was more 
interested in smaller size of serialized representation rather than in shorter 
serialization times. But IIRC, even before the last ArrayList implementations 
it was about 2 times faster then the stock version, probably because it had to 
write/read 3-4 times less bytes.

It would be cool, if you could use one of the serializer benchmarks to test it. 
I'm really curious about the improvements it may give.

> I'm thinking of making the strategy static final (non-switchable) once 
protostuff-
> runtime is loaded (RuntimeEnv).  Configuration will be via system properties 
(ala 
> RuntimeEnv).  This will be for concurrent correctness/performance.  What do 
you 
> think?

I'm not sure I like non-switchable approach that much. You remember the 
use-cases I'm interested in ;-) But I think that adding/removing anything for a 
given mapping should be forbidden, once it is in use (i.e. read/write was 
called at least once). Otherwise it may lead to unpredictable results. 

At the same time, I'd still like to have the possibility to assign a new 
default policy. And I still consider the thread-local approach or the like.
I also do not really belive that declaring the strategy as final static will 
make things much faster. But that can be just tested using a benchmark, or? If 
it really brings something (let's say  than 10-15%) than its worse effort. If 
not, it would just reduce the flexibility.

>By using a different classloader, you can load 2 instances of 
protostuff-runtime 
>(via jboss-modules) and have your main application communicate with use both 
(I 
>might be wrong here).

Actually, I do not use prostuff with JBoss. I use it in standalone apps and try 
to use it with Hazelcast, as a replacement for their default serialization.

May be the class-loader approach described by you is possible, but it is too 
complex, IMHO. I think it should be possible to assign new global policy. 

Looking forward to your review of the patch and benchmark figures!

Original comment by romixlev on 6 Mar 2012 at 6:26

GoogleCodeExporter commented 8 years ago

"I think it should be possible to assign new global policy. "

It is important for the static strategy to be set only once (atleast from a 
regular user's POV).  But your strategy can make it switchable (your use-case), 
by wrapping another strategy (not final, hence switchable) and delegate.

Original comment by david.yu...@gmail.com on 6 Mar 2012 at 8:21

GoogleCodeExporter commented 8 years ago

"It would be cool, if you could use one of the serializer benchmarks to test it"

The existing ones do not have a polymorphic dataset.  Will try to make one on 
the junit tests.

Original comment by david.yu...@gmail.com on 6 Mar 2012 at 8:23

GoogleCodeExporter commented 8 years ago

Hi David,

Any news regarding benchmarking or plans for including this feature into the 
trunk?

Original comment by romixlev on 17 Mar 2012 at 11:39

GoogleCodeExporter commented 8 years ago

This will definitely be included.  I'm still trying to find time to sweep the 
pending issues and do the next release. (Hopefully this weekend)

Original comment by david.yu...@gmail.com on 20 Mar 2012 at 10:03

GoogleCodeExporter commented 8 years ago

I've attached the modified patch (Note that 22 and 23 are used for enum_set and 
enum_map).  The test passes for the core formats including json and xml.
Benchmarks are next.
I'll also try to add the non-strict num strategy.

Original comment by david.yu...@gmail.com on 26 Mar 2012 at 12:50

Attachments:

issue-96.diff

GoogleCodeExporter commented 8 years ago

Here's v2 which allows registration of custom collections/maps (w/c will also 
now work without explicit registration when DefaultIdStrategy is used).

This will be committed soon.  Let me know if you have any concerns/issues.

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 11:14

Changed state: Accepted

Attachments:

v2.diff

GoogleCodeExporter commented 8 years ago

I'm wondering if the other id strategies (other than default) should be in 
another module.

RegisteredIdStrategy (strict ... already implemented)
IncrementalIdStrategy (generate ids on the fly ... to be implemented)

These 2 files being jarred will result to ~30kb alone.

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 11:18

GoogleCodeExporter commented 8 years ago

v3.diff

"I work for a company which does a lot with distributed systems coming from 
different vendors. Same encoding/numbering cannot be always ensured. Each 
client may be implemented/operated by a different 3rd party, which may have its 
own numbering of FQCN ids. Some of their classes are not even known to us in 
advance, as they can use our systems to exchange any kind of information, 
including their own classes (think about event buses, data grids and similar 
frameworks). We would usually receive their serialized messages over a network, 
do something with it and eventually forward to a different client (which may 
use a different numbering schema). We have no easy way to enforce a unique 
non-conflicting numbering. Therefore, I suggested that numbering should be 
associated to streams, because this way you associate it with a specific stream 
coming from a specific client and one specific client is hopefully using an 
internally consistent way of serialization for its own messages.

How would you cover such a use-case with your proposed approach? Or is it out 
of scope for protostuff?"

This is now doable (with the v3.diff).  You can have an IdStrategy you can use 
for specific incoming and outgoing streams.

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 1:37

Attachments:

v3.diff

GoogleCodeExporter commented 8 years ago

> Here's v2 which allows registration of custom collections/maps (w/c will also 
now 
> work without explicit registration when DefaultIdStrategy is used).

Cool!!!

Would it also work for very custom colllections/maps that are not derived from 
Collection or Map interfaces? If not, how would one handle them? This probably 
relates to our discussion on the Issue 88. But even if it does not work, it is 
a relatively seldom use-case probably. I'm asking about it for the sake of 
completeness.

> I'm wondering if the other id strategies (other than default) should be in 
another 
> module.
> RegisteredIdStrategy (strict ... already implemented)
> IncrementalIdStrategy (generate ids on the fly ... to be implemented)
> These 2 files being jarred will result to ~30kb alone.

You are right, may be having them in the dedicated module/jar is a good idea. 
If people don't want to use them, they should not carry >= 30KB with them. 
Especially on Android devices and the like, where storage space could be a bit 
limited.

> This is now doable (with the v3.diff).  You can have an IdStrategy you can 
use for 
> specific incoming and outgoing streams.

Very good!

My God! This is a very big patch. You performed quite a huge refactoring to 
propagate the strategy parameter through many, many APIs. I was too lazy to do 
that ;-)

I just looked at the diff, but have not applied it yet. But at the first glance 
I missed any test case for using an IdStrategy for specific incoming and 
outgoing streams. Have I overlooked such a test case or it is not there yet?

Original comment by romixlev on 27 Mar 2012 at 2:16

GoogleCodeExporter commented 8 years ago

Another question:

If I look at AbstractRuntimeObjectSchemaTest.IdStrategyFactory implementation, 
I see that it registers not only custom classes, but also the standard ones 
like CollectionSchema.MessageFactories.ArrayList and 
CollectionSchema.MessageFactories.HashSet.

I'm wondering if it is safer to have a re-usable code that pre-registers all 
built-in and standard classes, just to be on the safe side and ensure that for 
those classes the encoding is always the same. This code can then be called 
from any strategies that protostuff will ship or from custom ones. 

What do you think?

Original comment by romixlev on 27 Mar 2012 at 2:28

GoogleCodeExporter commented 8 years ago

"Would it also work for very custom colllections/maps that are not derived from 
Collection or Map interfaces? If not, how would one handle them? This probably 
relates to our discussion on the Issue 88. But even if it does not work, it is 
a relatively seldom use-case probably. I'm asking about it for the sake of 
completeness."

You do it the usual way.  Create a schema for that collection/map and register 
it like a regular message.  You'll have explicit control on how it is 
serialized.

"I just looked at the diff, but have not applied it yet. But at the first 
glance I missed any test case for using an IdStrategy for specific incoming and 
outgoing streams. Have I overlooked such a test case or it is not there yet? "

See AbstractRuntimeObjectSchemaTest.

Say for your use case.  You basically associate a specific IdStrategy to your 
input/output.
Creating a schema from a specific IdStrategy would basically be:
Schema<Foo> schema = RuntimeSchema.getSchema(Foo.class, idStrategy);

That schema and all its dependencies will be based from that id strategy 
(isolated).

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 2:34

GoogleCodeExporter commented 8 years ago

"I'm wondering if it is safer to have a re-usable code that pre-registers all 
built-in and standard classes, just to be on the safe side and ensure that for 
those classes the encoding is always the same. This code can then be called 
from any strategies that protostuff will ship or from custom ones. "

The RegisteredIdStratedy is strict in a sense that you have full control of 
every single detail.
The reason for which is that ids 1-15 only take 1 byte to serialize (with the 
protobuf tag).  So if a user knows he has more custom collections that are 
frequently used (for example he has 15), he will probably wanna allocate the 
first 15 ids to it.

Nothing beats full control if you ask me.

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 2:38

GoogleCodeExporter commented 8 years ago

> Say for your use case.  You basically associate a specific IdStrategy to your 
> input/output.
> Creating a schema from a specific IdStrategy would basically be:
> Schema<Foo> schema = RuntimeSchema.getSchema(Foo.class, idStrategy);
> That schema and all its dependencies will be based from that id strategy 
(isolated).

OK. I think I got it. 

So, one basically creates idStrategy-bound schemas for the required set of 
classes used for serialization. And then one obtains different schemas for the 
same class by binding it to different idStrategies. One this is done, you just 
use a corresponding idStrategy-specific schemas to read/write in a required 
serialization format. Right?

One question, just to check:
When implementing transferXXX from input to output stream, do you take into 
account that input IdStrategy could be different from the output idStrategy. Or 
is it assumed that they are the same in case of tranfser-methods?

> The RegisteredIdStratedy is strict in a sense that you have full control of 
every 
> single detail.
> The reason for which is that ids 1-15 only take 1 byte to serialize (with the 
> protobuf tag).  So if a user knows he has more custom collections that are 
> frequently used (for example he has 15), he will probably wanna allocate the 
first 
> 15 ids to it.
> Nothing beats full control if you ask me.

I agree with you. 

My point was to provide some sort of a boiler plate code for those, who are not 
such a control junkie ;-) If they do not want to control everything up to the 
last bit, then having a common predefined way to register most popular types 
and classes is just safer and quicker during development. Otherwise they may 
need to do several iterations of edit/compile/run before they include all 
required standard classes used by their app into the registry.

Original comment by romixlev on 27 Mar 2012 at 2:51

GoogleCodeExporter commented 8 years ago

One more comment:

I have the impression that with your current patch one needs to always provide 
a concrete numeric id, when registering a class. 

What about a use-case, where registration of classes does not happen in one 
place (e.g. in one class or method), but is spread among few places, which are 
written by different parties (e.g. by a 3rd party) and therefore do not know 
exactly, which id ranges are already used.

May be there should be a convenience method to assign a next free id to a given 
class when it is registered? 
Or at least a method to get a number of currently registered classes (more 
precisely, the highest used id) for a given strategy. Then one can generate the 
next id itself by using (maxId + 1) when calling the register method.

Original comment by romixlev on 27 Mar 2012 at 3:02

GoogleCodeExporter commented 8 years ago

That is handled by the IncrementalIdStrategy (a hybrid where you can register 
and allocate the first x ids, then the rest can be dynamic).

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 3:33

GoogleCodeExporter commented 8 years ago

Attached v4.diff (includes the incremental id strategy).

Kindly take a look at the impl.  If you have a better idea on how to improve 
its perf, let me know.

If all is well, I'll commit this tomorrow.
The registered and incremental id strategies will likely be on its own module.

Original comment by david.yu...@gmail.com on 27 Mar 2012 at 6:57

Attachments:

v4.diff

GoogleCodeExporter commented 8 years ago

Hi David,

I had a quick look at the v4.diff

Here are my quick comments:

1) RegisterIdStrategy uses ArrayLists for Integer->Class mappings. But 
IncrementalIdStrategy uses ConcurrentHashMaps for the same mapping. Is it on 
purpose? I guess it introduces some overhead.

2) RegisterIdStrategy and IncrementalIdStrategy do usually at least one lookup 
in the Map (Class->Integer) to get the id of the class. The reason for that is 
the fact that you pass Class as a parameter almost everywhere in IdStrategy 
functions. In my version I passed Schema or HasSchema almost everywhere and 
they had getId() method directly. But probably it was doing a similar lookup on 
the caller's side to obtain a Schema/HasSchema/SchemaWrapper object from the 
value.

3) IncrementalIdStrategy allows to reserve some initial ids and start assigning 
ids from a given number. But it does not implement the method "register" and 
throws an exception. The comment for this class says "and you can initially 
register classes". My understanding was that one can use reserved ids for 
initially registered classes and all others get their ids autoassigned. But how 
can such classes be initially registered then?

4) Nice to have (was present in my patch): There is no shipped strategy which 
would be almost like IncrementalIdStrategy, but would allow to send/serialize a 
mapping from a FQCN to the numeric id (i.e. would send both FQCN and numeric 
id) when it is used first time. The idea with such a strategy is that the 
receiver/deserializer does not need to know in advance the mapping used by the 
sender. The receiver can get it from the input stream directly and build a 
local strategy on the fly as it reads input. So, the sender more or less sends 
a dictionary/mapping entry by entry when it discovers a new entry. The use-case 
for such a strategy: you get a serialized message from a 3rd party client and 
you have no idea in advance about encoding it uses. So, either you need to 
handshake a dictionary/idStrategy with that 3rd party client by some other 
means or you use such a strategy like I described and then you can deserialize 
such messages as long as classes with mentioned names are present in your 
class-loader.

Original comment by Marina.L...@gmail.com on 27 Mar 2012 at 8:14

GoogleCodeExporter commented 8 years ago

One more small comment (I repeat it, as you haven't answered)

When implementing transferXXX (e.g. in Idstrategy-based classes) from input to 
output stream, do you take into account that input IdStrategy could be 
different from the output idStrategy. Or is it assumed that they are the same 
in case of tranfser-methods? My feeling is that you assume at many places that 
the same strategy is used for both streams.

Original comment by Marina.L...@gmail.com on 27 Mar 2012 at 8:19

GoogleCodeExporter commented 8 years ago

First of all, Marina == romixlev?
1.  ConcurrentHashMap is needed as it needs to be thread safe and the ids are 
atomically allocated.  RegisteredIdStrategy on the other hand is immutable 
hence you don't pay for the concurrent data structures.

2.  "In my version I passed Schema or HasSchema almost everywhere and they had 
getId() method directly"

Basically you looked up the schema first then you handed it to IdStrategy.
Here, the IdStrategy is the one who looks up first then returns the schema.  So 
now time lost here.  Basically the lookup is delegated from 
ObjectSchema/DerivativeSchema to the IdStrategy.

3.  Ah yes, I'll add public methods to do so.  Note that the register method 
does not take an it (It is only there for backwards compatibility ... ensuring 
the others who use RuntimeSchema.register(clazz) will work out-of-the-box 
because its implemented on DefaultIdStrategy)

4.  See #5.

5.  It must be the same IdStrategy otherwise the id lookups would fail.  You 
can although implement a custom strategy for your needs reads from a 
DefaultIdStrategy (fqcn) and writes an id (ala 
RegisteredIdStrategy/IncrementalIdStrategy)

The idea is on transfer, you:
1. Read the string fqcn
2. Load the class
3. Assign an int id
4. Write the id

As you said, this will work as long as the class is available on the receiving 
end's classloader.

Use the existing strategies as a template and customize to fit your needs 100%.

Original comment by david.yu...@gmail.com on 28 Mar 2012 at 3:10

GoogleCodeExporter commented 8 years ago

> First of all, Marina == romixlev?

Oops. Sorry. Yes, it is me, romixlev. Obviously, I forgot to log out from my 
wife's gmail account at home, when I was submitting a comment ;-)

> 1.  ConcurrentHashMap is needed as it needs to be thread safe and the ids are 
> atomically allocated.  RegisteredIdStrategy on the other hand is immutable 
hence 
> you don't pay for the concurrent data structures.

I see your point. But I think that the price/overhead is too high. IMHO, a 
typical scenario would first (or let's say in the initial phase) register 
explicitly some classes with reserved ids and then auto-register classes using 
the usual IncrementalIdStrategy's approach. Once this is over, there are no 
changes to be expected, especially changes that happen often. Therefore using 
ConcurrentHashMap could be an overkill. May be some sort of locking in the 
initial phase only and later no locks? E.g. introducing some sort of a flag 
that would indicate that the registry is built now and no further changes are 
possible? And methods to set/reset it (sort of freeze/unfreeze)? If this flag 
is set, then registry becomes immutable and there is no need for any locking. 
If flag is not set, then registry is still being built and locks should be used 
when accessing the mappings....
Do you think something can be tweaked along these lines to make performance 
better for the majority of typical scenarios using this strategy?

2) Regarding the lookups moved from ObjectSchema to IdStrategy: yes, you are 
right. This is what I guessed.

5) I see your point. My version also didn't implement it, i.e. using different 
strategies for input and output. Doing so in the most general case would 
require an association between streams and strategies, I guess. Then one could 
say something like:
   inPojoId = in.getStrategy().resolvePojoId...();
   outPojoId = ...; // find in the out.getStrategy() what is the id corresponding to inPojoId class 
   out.getStrategy().writePojo(outPojoId);

 But let's forget about it for now. It is not the most important feature ;-)

Original comment by romixlev on 28 Mar 2012 at 6:52

GoogleCodeExporter commented 8 years ago

With a hard limit, we can use an array list for the ids instead of 
ConcurrentHashMap.
E.g, you specify a max, then the array list will allocate once and will never 
resize (and will be protected by a volatile field as the happens-before 
barrier).
Will attach the final patch and I think its good to go.

Original comment by david.yu...@gmail.com on 28 Mar 2012 at 7:31

GoogleCodeExporter commented 8 years ago

> With a hard limit, we can use an array list for the ids instead of 
> ConcurrentHashMap.
Yes.

> E.g, you specify a max, then the array list will allocate once and will never 
> resize (and will be protected by a volatile field as the happens-before 
barrier).
> Will attach the final patch and I think its good to go.

In ideal case, it would be nice to avoid setting max in advance, as you 
probably don't know the sharp max limit (though you can always set it to a 
reasonably big number, which should be OK for sure). That's why I was thinking 
about using something resizable at the beginning and then, once one says 
"freeze" it becomes immutable and can use a more efficient implementation 
(ArrayList) from now on. But this is probably too complex compared to just 
reserving enough space for max possible number of items.

Original comment by romixlev on 28 Mar 2012 at 8:18

telenths / protostuff

user definable mapping: FQNC's --> integer, for serialized stream size-reduction #96