Closed GoogleCodeExporter closed 8 years ago
Hi,
Please find attached the initial version of the patch, as described in the
following thread:
https://groups.google.com/forum/?fromgroups#!topic/protostuff/1W2_v3wp4F8
It implements the encoding of FQNCs using numeric ids. This is done for objects
of types unknown in advance, but also for all maps, collections, arrays and
enums.
There are two new classes: IdentifierStrategy and NumIdIdentifierStrategy that
encapsulate the logic for reading and writing classnames.
The IdentifierStrategy class implements the old behavior.
The NumIdIdentifierStrategy class implements the new behavior and uses numric
ids.
A test case can be found in ObjectSchemaWithIdsTest.
To register mappings, one needs to call
RuntimeSchema.setClassId(classname, id)
The mapping will be only applied, if the following is called:
IdentifierStrategy.setStrategy(new NumIdIdentifierStrategy());
TODOs:
- Better way to set/initialize the IdentifierStrategy (see the mentioned thread
for some ideas)
- Put IdentifierStrategy as a field into the Schema to avoid map lookups during
serialization?
- More efficient encoding of identifiers in case they are not registered by any
mapping (now we waste 32bits for keeping the 0 value, indicating that an
identifier follows; and we use and additional field number, because I have not
found a way to output a string to Output without a field number)
- More efficient encoding of numeric ids (something like variable length
VarInt32 encoding?)
- Backwards compatibility? Ability to read both old and new encodings from the
same Input?
Original comment by romixlev
on 13 Feb 2012 at 4:27
Attachments:
An updated version of the patch.
It moves all maps and mapping related function into IdentifierStrategy class
and its sub-classes, which is much cleaner.
It also introduces a new AutoNumIdIdentifierStrategy class, which does not
require a dedicated registration in advance and does it automatically, when a
new identifier occurs for the first time.
Original comment by romixlev
on 14 Feb 2012 at 10:06
Attachments:
Thanks for the patch!
A few remarks:
1. Left-curly brackets not *consistently* on a new line)
2. IdentifierStrategy as abstract class, with DefaultIdentifierStrategy being
the default impl (writes FQCNs and does not include the concurrenthashmap).
3. The AutoNumIdIdentifierStrategy is better off using an array list rather
than a map<integer,string>
default = no maps, num = map<int,string>, autonum = list<string>
We can also introduce "strict", w/c will throw an exception if it encounters an
unregistered message. This will avoid bugs, security issues.
"More efficient encoding of numeric ids (something like variable length
VarInt32 encoding?)" - underneath it already is varint encoded.
Additional TODOs:
- Avoid having to do map lookups (id) during serialization (applies to
enums/collections/maps/pojos).
- make it possible to register/configure custom collections/maps (with ids as
well)
I think this is a good head start.
Original comment by david.yu...@gmail.com
on 14 Feb 2012 at 3:02
> 1. Left-curly brackets not *consistently* on a new line)
Fixed.
> 2. IdentifierStrategy as abstract class, with DefaultIdentifierStrategy
being the
> default impl (writes FQCNs and does not include the concurrenthashmap).
Fixed.
> 3. The AutoNumIdIdentifierStrategy is better off using an array list rather
than
> a map<integer,string>
Fixed.
> default = no maps, num = map<int,string>, autonum = list<string>
Done. I think, we could also use list<String> for num, in principle. In the
worst case (big id is assigned), we just grow the ArrayList.
> We can also introduce "strict", w/c will throw an exception if it encounters
an
> unregistered message. This will avoid bugs, security issues.
> Additional TODOs:
> - Avoid having to do map lookups (id) during serialization (applies to
> enums/collections/maps/pojos).
I'm not sure how to do this technically and semantically. We need to put info
about the id into schemas, I guess. But the problem is: schemas do not even
know the name of the classes they relate to. Then Enums don't have schemas, as
you mentioned. And last, but not least, I don't like the tight coupling between
schemas and identifiers and having only one global identifier numbering. The
reason is: imagine that we have 2(or more) apps using protostuff and written by
different people at different time. It could very well be that they use
inconsistent numeric ids, i.e. everyone introduced his own numbering. Or even
in case of automatic numbering, it is different due to different class-loading
order. And now imagine that those apps need to exchange information using
protostuff encoding. We can either force them to use the same numbering or we
can try to support different numberings for the schemas. E.g. we use one
numbering for incoming messages from a given partner and another for outgoing.
To solve this and also to be able to propagate the IdentifierStrategy, I think
it could be a good idea to associate it with something that is always
propagated anyway, e.g. with Inputs or Outputs. Which means that we'll get
stream specific numbering, which makes a lot of sense. Of course, of no
explicit IdentifierStrategy is associated with a stream it can take a global
one as a default.
What do you think? If you like the idea of IdentifierStrategy associated with
streams, what would be the way to achieve it?
> - make it possible to register/configure custom collections/maps (with ids as
well)
Ids can be registered for anything as they are not connected to schemas in any
form currently. Ability to register/configure custom collection/maps is useful,
but it is a bit beyond of scope of this issue, IMHO, and will require many more
changes.
> - Put IdentifierStrategy as a field into the Schema to avoid map lookups
during
> serialization?
See my considerations above.
> - More efficient encoding of identifiers in case they are not registered by
any
> mapping (now we waste 32bits for keeping the 0 value, indicating that an
> identifier follows;
> and we use and additional field number, because I have not found a way to
output a
> string to Output without a field number)
Any idea how to achieve this (String without field number in front of it?) for
the NumIdIdentifierStrategy?
Original comment by romixlev
on 15 Feb 2012 at 3:33
Attachments:
> - More efficient encoding of identifiers in case they are not registered by
any
> mapping (now we waste 32bits for keeping the 0 value, indicating that an
> identifier follows;
Not really 32bits as it is varint encoded.
The approach is to use a different field number to write the int id (rather
than same as FQCN, 127).
Basically on ser:
If registered, writeInt32(126, id, false);
If not, writeString(127, fqcn, false);
On deser, if 126, registered .. if 127, fqcn, else throw exception.
This way, no extra field is written (wasted).
Original comment by david.yu...@gmail.com
on 16 Feb 2012 at 3:55
> The approach is to use a different field number to write the int id (rather
than
> same as FQCN, 127).
> Basically on ser:
> If registered, writeInt32(126, id, false);
> If not, writeString(127, fqcn, false);
> On deser, if 126, registered .. if 127, fqcn, else throw exception.
>This way, no extra field is written (wasted).
I see your point. But it is a bit more complicated. We use IdentifierStrategy
not only for ID_POJO(127), but also for ID_MAP, ID_COLLECTION, ID_ENUM, etc.
Therefore, we need to introduce new tags for them as well, if we follow your
approach. And if we later introduce any new ID_somthing, we'll need to
introduce an additional code for its numid encoded field number.... I'm not
100% convinced it is a best possible approach. It couples id encoding with
fields, their nums and their semantics...
Do you still say we should do it this way or are there any alternatives?
P.S. Any comments on other questions? Especially on relationship between
strategies and schemas or streams?
Original comment by romixlev
on 16 Feb 2012 at 5:34
Regarding the relationship between strategies and schemas, I think we can put
an extra id field (int) on the RuntimeSchema to avoid extra lookups. Still I
prefer this be implemented together with enums/collections/maps as a unified
way and not just for pojos.
"And if we later introduce any new ID_somthing, we'll need to introduce an
additional code for its numid encoded field number.... I'm not 100% convinced
it is a best possible approach. It couples id encoding with fields, their nums
and their semantics...
Do you still say we should do it this way or are there any alternatives?"
We'll only implement that for pojos (127='_' fqcn, 126='$' id) since that is
the most common.
To ensure that there won't be a waste in enums/collections/maps, one can employ
a StrictNumStrategy (can be a good reminder for a developer to register).
Original comment by david.yu...@gmail.com
on 16 Feb 2012 at 6:13
'$' does not seem to be valid xml name.
Btw, don't you prefer to use StrictNumStrategy? (It can be for security
purposes as well.)
Others who don't use it are basically saying they can live with FQCNs.
Its basically all or nothing.
Original comment by david.yu...@gmail.com
on 16 Feb 2012 at 6:28
> We'll only implement that for pojos (127='_' fqcn, 126='$' id) since that is
the
> most common.
> To ensure that there won't be a waste in enums/collections/maps, one can
employ a
> StrictNumStrategy (can be a good reminder for a developer to register).
OK. I can do it this way.
> Btw, don't you prefer to use StrictNumStrategy? (It can be for security
purposes
> as well.) Others who don't use it are basically saying they can live with
FQCNs.
I like the StrictNumStrategy, which I'll add today ;-)
My only problem with it is when serilization and deserialization is controlled
by different parties and uses potentially different numberings, if at all (and
this is very common for networking applications, where different independently
developed nodes talk to each other). In those cases StrictNumStrategy does not
help that much. In many such cases, you actually want the serializing party to
tell the deserializing party how to deserialize. This was a bit the motivation
for associating IdentifierStrategy with streams, not with schemas. But for
those cases, where it is not an issue, I agree that StrictNumStrategy is a good
default choice.
> '$' does not seem to be valid xml name.
Is it a requirement that field name is a valid xml name? If so, what do you
propose?
> Regarding the relationship between strategies and schemas, I think we can put
an
> extra id field (int) on the RuntimeSchema to avoid extra lookups.
Yes that could be done. How do you propose to initialize it? When a mapping is
being registered? I.e. if we register a new mapping and a schema exists
already, we set this field. Or if schema is created after registration, we
check during its creation which numid is assigned to the corresponding fqcn?
BTW, is it guaranteed that there is at most one fqcn that refers to any given
RuntimeSchema or could they be shared somehow?
> Still I prefer this be implemented together with enums/collections/maps as a
> unified way and not just for pojos.
I'd like to see it as well. But I don't yet see how to do it in a nice simple
way.
Original comment by romixlev
on 16 Feb 2012 at 7:07
- Added a strict policy
- Updated the JUnit test
- Implemented this (slightly different): "The approach is to use a different
field number to write the int id (rather than same as FQCN, 127)"
Original comment by romixlev
on 16 Feb 2012 at 1:48
Attachments:
Hi David,
Any comments? How should we proceed?
Original comment by romixlev
on 20 Feb 2012 at 9:54
"> Still I prefer this be implemented together with enums/collections/maps as a
> unified way and not just for pojos.
I'd like to see it as well. But I don't yet see how to do it in a nice simple
way."
An interface "HasId" could be added to protostuff-api, which could be
implemented by the components. Setting the id could be done after creation.
For enums, EnumIO would implement HasId.
The strategy will need ArrayList<EnumIO> for the id index (deser optimization).
For pojos, HasSchema would implement HasId.
The strategy will need ArrayList<HasSchema> for the id index (deser
optimization).
For collections/maps, the CollectionSchema.MessageFactory would implement HasId.
As well as MapSchema.MessageFactory.
The strategy will need ArrayList (each) for the id index (deser optimization).
I believe that will provide the best performance as it eliminates the extra
lookups on ser (string-to-id) and deser (id-to-string).
Any scenarios that I might have overlooked?
Original comment by david.yu...@gmail.com
on 20 Feb 2012 at 2:00
Hi David,
Thanks for the feedback.
> Any scenarios that I might have overlooked?
Well, I outlined the reasons for not associating ids to schemas directly. But
obviously, you have a different opinion ;-)
I work for a company which does a lot with distributed systems coming from
different vendors. Same encoding/numbering cannot be always ensured. Each
client may be implemented/operated by a different 3rd party, which may have its
own numbering of FQCN ids. Some of their classes are not even known to us in
advance, as they can use our systems to exchange any kind of information,
including their own classes (think about event buses, data grids and similar
frameworks). We would usually receive their serialized messages over a network,
do something with it and eventually forward to a different client (which may
use a different numbering schema). We have no easy way to enforce a unique
non-conflicting numbering. Therefore, I suggested that numbering should be
associated to streams, because this way you associate it with a specific stream
coming from a specific client and one specific client is hopefully using an
internally consistent way of serialization for its own messages.
How would you cover such a use-case with your proposed approach? Or is it out
of scope for protostuff?
Original comment by romixlev
on 20 Feb 2012 at 3:00
How would you cover such a use-case with your proposed approach? Or is it out
of scope for protostuff?
There is a way but I dunno if it will perform.
We can use a ThreadLocal and the ObjectSchema will use that to check if there
was a strategy to use for ser and deser.
It simply cannot be done on the stream itself since it has no methods like
readIdFromInput/etc/etc.
Original comment by david.yu...@gmail.com
on 21 Feb 2012 at 3:49
> There is a way but I dunno if it will perform.
> We can use a ThreadLocal and the ObjectSchema will use that to check if there
was
> a strategy to use for ser and deser.
I don't think this will scale...
> It simply cannot be done on the stream itself since it has no methods like
> readIdFromInput/etc/etc.
I think I have a possible solution.
Let's auto-assign a numeric id to each schema registered locally (this is not
the id used during serialization!). Let's call id schemaId. We can easily
set/get it using getSchemaId/setSchemaId using the same approach you described
before.
And then, in each IdentifierStrategy we have a mapping from SchemaId to
IdentifierId. The later one is used for serialization/deserialization. Since we
map int to int, we can use e.g. ArrayList and be very efficient.
We can also have a reverse mapping from IdentifierId to SchemaId.
Then, we add two methods to each stream (Input or Output):
setIdentifierStrategy/getIdentifierStrategy. They would associate a custom
strategy (i.e. custom numbering) with a given stream. By default, if no
strategy is provided, a stream would use the global one.
Now, if we need to serialize a given object O belonging to schema S using a
stream IO we do:
- get schemaId from schema (simple getter in Schema object)
- get IdentifierStrategy to be used ( using IO.getIdentifierStrategy)
- call IdentifierStrategy.getIdentifierIdForSchemaId(schemaId) - indexed access
to an ArrayList
- write this identifierId into the stream
And on deserialization we do:
- get IdentifierStrategy to be used ( using IO.getIdentifierStrategy)
- read identifierId from a stream using this strategy
- call IdentifierStrategy.getSchemaIdFromIdentifierId(identifierId) or
IdentifierStrategy.getSchemaFromIdentifierId(identifierId) depending on what we
need.
So, in both cases, we never do any expensive lookups using maps. And this
approach combines your proposal with a flexibility of being able to use
different encoding with different streams. This is just an idea. I haven't
implemented it yet.
What do you think about it? Something that I overlooked? Would it work?
Original comment by romixlev
on 21 Feb 2012 at 4:31
"Then, we add two methods to each stream (Input or Output):
setIdentifierStrategy/getIdentifierStrategy. They would associate a custom
strategy (i.e. custom numbering) with a given stream. By default, if no
strategy is provided, a stream would use the global one."
Not a good idea. The input/output should not be concerned/associated with ids.
They should be as stateless as possible, which makes it easier to add new
formats. Plus, vanilla protostuff/protobuf does not even need ids for messages.
Any other ideas?
Original comment by david.yu...@gmail.com
on 21 Feb 2012 at 11:39
> Not a good idea. The input/output should not be concerned/associated with
ids.
> They should be as stateless as possible, which makes it easier to add new
> formats.
> Plus, vanilla protostuff/protobuf does not even need ids for messages.
> Any other ideas?
1) Introduce an external mapping using IO stream objects(or may be its internal
stream id, that we could introduce?
Is it OK to introduce getId/setId as methods to auto-assign/read unique stream
ids? This would still keep streams rather stateless)
as a key and mapping it to contexts, which include IdentifierStrategy.
Depending on the implementation's effectiveness of such a mapping, the look-ups
could be rather costly or fast.
2) Introduce an optional "context" parameter to each read/write operation
invoked on schema (and streams?).
This context can propagate a lot of information around, including
IdentifierStrategy reference.
Of course, this method is very intrusive and would require a lot of API changes
and refactoring due to
introduction of this parameter.
The mentioned ThreadLocal is an option for context propagation, but not a very
scalable one, IMHO.
Especially if you have thread pools and thousands of Runnables, each using its
own IdentifierStrategy (and context).
Then you'll spend a lot of time just setting/reading those ThreadLocal settings.
Do you see any other scalable way to associate different strategies with with
streams?
Original comment by romixlev
on 22 Feb 2012 at 8:39
One more thinko in addition to mentioned above:
Imagine that we do not change the interfaces for Input/Output.
Instead we define a new interface HasId. Some of the implementations of
Input/Output/Schema may implement it and provide their id. Some others - may
not.
Every time we need to figure out a strategy specific for a given stream, we
check if it is an instance of HasId. If so, we take its id and do mapping from
id to context related to this stream (it can be done efficiently, as it is an
integer). If this is not a HasId impl, we simply use the global context, i.e.
the global IdentifierStrategy in our case.
This way, we have minimal changes to interfaces and only slightly enhance
implementations of existing protostuff IO streams.
Original comment by romixlev
on 22 Feb 2012 at 2:55
I'm not too fond of adding these to Input/Output (its only job should be
stateless read/write).
The one I suggested earlier would allow for unified impl for ids as well as
maintain performance.
You have a special usecase in which it might be difficult to integrate
unobtrusively.
The goal is performance + unobtrusive.
Original comment by david.yu...@gmail.com
on 26 Feb 2012 at 5:21
Hi David,
> I'm not too fond of adding these to Input/Output (its only job should be
stateless
> read/write).
My last proposals WOULD NOT add ANY state to streams. It would add only unique
immutable stream identifiers (they have them in fact already as internal JVM
object ids, but these ones are not very handy for mapping using ArrayList). So,
streams remain as stateless, as they were before. But anyway, let's skip it for
now, as you don't like it ...
> The one I suggested earlier would allow for unified impl for ids as well as
> maintain performance.
I'm trying now to implement your proposal, but face some issues. Probably I
don't quite understand some bits of it.
> An interface "HasId" could be added to protostuff-api, which could be
implemented
> by the components. Setting the id could be done after creation.
Done.
> For enums, EnumIO would implement HasId.
Done.
> The strategy will need ArrayList<EnumIO> for the id index (deser
optimization).
Added this mapping.
> For pojos, HasSchema would implement HasId.
Done.
> The strategy will need ArrayList<HasSchema> for the id index (deser
optimization).
Added this mapping.
> For collections/maps, the CollectionSchema.MessageFactory would implement
HasId.
> As well as MapSchema.MessageFactory.
Done.
> The strategy will need ArrayList (each) for the id index (deser optimization).
Added this mapping.
BTW, implementing HasId at X places using exactly the same code-snippet looks
pretty ugly, especially for CollectionSchema.MessageFactory and
MapSchema.MessageFactory, because each enumeration value needs to implement it.
> I believe that will provide the best performance as it eliminates the extra
> lookups on ser (string-to-id) and deser (id-to-string).
I see how to use it for serialization. If there is a HasId schema/EnumIO/etc
passed as a param, we just use HasId.getId() and obtain the numeric id without
any expensive look-up based on the fqcn. (BTW, we only pass in the HasId in
case of POJOs mostly. In all other cases, we pass only the FQCN. If we need to
pass a HasId for Enum or Map, we'd need to do a lookup in the caller method.
So, what do we gain then?)
But what about de-serialization? Currently, the code in ObjectSchema/etc looks
like this:
String identifier = IdentifierStrategy.readIdentifier();
// And then eventually (but not for Arrays) lookup a Schema or EnumIO or
// CollectionSchema.MessageFactory by this identifier
For Maps and Collections it is even more complex and may decide to look-up
using EnumIO or MessageFactory depending on the identifier (see
ObjectSchema.readObjectFrom)
So, just to clarify, do you suggest to replace such snippets by e.g.
HasSchema wapper = IdentifierStrategy.readIdentifierGetWrapper() ?
If this is the plan, what should be the return type of such a method? If we
want to have only one method to rule them all, we have a problem, as Schema,
EnumIO and MessageFactory do not have a common base type. So, do we want to
return Object? Or do we want to have a set of methods for reading a (numeric)
id and returning a required data structure? For example something like this:
IdentifierStrategy.readIdentifierGetSchema();
IdentifierStrategy.readIdentifierGetEnumIO();
IdentifierStrategy.readIdentifierGetCollectionMessageFactory();
IdentifierStrategy.readIdentifierGetMapMessageFactory();
Overall, I have a bit of the feeling that all the added complexity does not
justify the potential advantages that are yet to be proven.
Would be nice if you could clarify on these topics, so that I can proceed.
Original comment by romixlev
on 27 Feb 2012 at 12:47
Looking at the patch (ObjectSchema), you'll notice that IdentifierStrategy's
delegate method (readIdentifier) returns a string. The change needed is to
return the HasSchema for pojos, EnumIO for enums,
CollectionSchema.MessageFactory for collections and MapSchema.MessageFactory
for maps.
The DefaultStrategy would contain the old impl while the NumStrategy will read
the int id and return the wrapper from its index (ArrayList).
Original comment by david.yu...@gmail.com
on 28 Feb 2012 at 2:36
Hi David,
1)
> The DefaultStrategy would contain the old impl while the NumStrategy will
read the
> int id and return the wrapper from its index (ArrayList).
Yes. This is exactly how I started implementing it yesterday.
> Looking at the patch (ObjectSchema), you'll notice that IdentifierStrategy's
> delegate method (readIdentifier) returns a string. The change needed is to
return
> the HasSchema for pojos, EnumIO for enums, CollectionSchema.MessageFactory
for
> collections and MapSchema.MessageFactory for maps.
Yes. But this is what I already stated in my question, or?
So, do you want only one method that returns an Object (depending on the kind
of id), that needs to be casted to HasSchema, EnumIO or one of the
MessageFactories respectively in the ObjectSchema and other clients of this
API? Is it a correct understanding?
2) Another question. Here is a snippet from ObjectSchema class
case ID_COLLECTION:
final String collectionType = IdentifierStrategy.getInstance()
.readIdentifier(input, number, schema);
if(collectionType.indexOf('.') != -1)
{
// EnumSet
final Collection<?> c = EnumIO.get(collectionType, true).newEnumSet();
if(input instanceof GraphInput)
{
// update the actual reference.
((GraphInput)input).updateLast(c, owner);
}
COLLECTION_SCHEMA.mergeFrom(input, (Collection<Object>)c);
return c;
}
final Collection<Object> collection =
CollectionSchema.MessageFactories.getFactory(
collectionType).newMessage();
if(input instanceof GraphInput)
{
// update the actual reference.
((GraphInput)input).updateLast(collection, owner);
}
COLLECTION_SCHEMA.mergeFrom(input, collection);
return collection;
case ID_MAP:
final String mapType = IdentifierStrategy.getInstance()
.readIdentifier(input, number, schema);
if(mapType.indexOf('.') != -1)
{
// EnumMap
final Map<?,Object> m = EnumIO.get(mapType, true).newEnumMap();
if(input instanceof GraphInput)
{
// update the actual reference.
((GraphInput)input).updateLast(m, owner);
}
MAP_SCHEMA.mergeFrom(input, (Map<Object, Object>)m);
return m;
}
final Map<Object,Object> map =
MapSchema.MessageFactories.getFactory(mapType).newMessage();
if(input instanceof GraphInput)
{
// update the actual reference.
((GraphInput)input).updateLast(map, owner);
}
MAP_SCHEMA.mergeFrom(input, map);
return map;
As you can see, in the case of maps and collections, it is not known in advance
whether we need MessageFactories or EnumIO. So, how should the readIdentifier
implementation know what to return in this case?
3) Arrays just write/read the string identifier. But there is no explicit
wrapper for Array objects. What do we do with it? Or should ArrayWrapper
implement HasId to follow other kinds of schemas?
Original comment by romixlev
on 28 Feb 2012 at 5:06
1. A method for each type (pojo/enum/collection/map).
E.g
HasSchema<Object> hs = strategy.resolvePojo(...)
EnumIO<?> eio = strategy.resolveEnum(...)
2. strategy.newMap would return Map. strategy.newCollection would return
Collection.
You will know if the component is an enum because of id's first two bits.
3. We could implement it later after this.
Original comment by david.yu...@gmail.com
on 28 Feb 2012 at 11:16
> 2. strategy.newMap would return Map. strategy.newCollection would return
> Collection.
> You will know if the component is an enum because of id's first two bits.
Was a bit busy with other things, but now I'm almost finished. Still have a
question here, because I do not quite understand "You will know if the
component is an enum because of id's first two bits":
When I've read ID_COLLECTION or ID_MAP in the input stream in
ObjectSchema.readObjectFrom() method, I haven't seen the numeric id that
follows it yet! So, I don't know in readObjectFrom if I'll need to lookup via
EnumIO or a proper MessageFactories. And I don't have the first two bits of the
numeric id yet to decide.
Therefore, I see the following alternatives:
1) First read the identifier as string (because default policy cannot read it
as integer) and then decide what to do - this is what is done now
2) Introduce in IdentifierStrategy methods resolveMapOrEnum and
resolveCollectionOrEnum. They would read numeric/string id, do lookup or use
first 2 bits, then determine if it is enum or map/collection and return a
corresponding object, which needs to be casted in ObjectSchema into something
appropriate.
3) Store ID_MAP and ID_COLLECTION in a special way: if the identifier requires
enumio lookup, use new tags ID_MAP_ENUM and ID_COLLECTION_ENUM respectively in
writeObjectTo. Then, in readObjectFrom have special switch cases for
ID_MAP_ENUM and ID_COLLECTION_ENUM.
Or may be I miss something and still misunderstand you?
Thanks,
Leo
Original comment by romixlev
on 2 Mar 2012 at 3:11
I see you've been busy on the jmeter plugins :-) We've a lot of groups in
common hehe.
I haven't looked though the enum handling. Your third suggestion sounds good.
The important thing is the strategy needs to know whether to return EnumMap or
EnumSet.
ObjectSchema can simply receive those as Map and Collection without any extra
handling.
Original comment by david.yu...@gmail.com
on 2 Mar 2012 at 8:02
Hi,
I implemented the the handling of POJOs, paps, collections and enums as
discussed. We have a HasId interface now and serialization, de-serialization is
done avoiding any expensive look-ups. It uses only indexed access to ArrayLists.
What could be improved:
1) It uses different ArrayLists for enums, pojos, collections, maps. May be a
single ArrayList can be used using the shitfing by 2 bits trick, which would
allow to encode the kind of entry using a 2bit tag. But this is a minor issue,
as it does not affect performance. Only the memory consumption will be improved.
2) The fact that ids of types are stored in the HasId implementations makes it
difficult (if not impossible) to use different IdentifierStrategies with
different numeric ids for the same type. A possible workaround could be to
treat the ids stored in HasId implementations as internal unique ids of those
types and then have another mapping from those ids to a
IdentifierStrategy-specific numeric ids. It would require one additional
ArrayList<Integer> (for more details see my proposal in the previous comments)
3) More unit tests would be nice. In particular, I'm not 100% confident about
ObjectSchema.transferObject and a few other functions.
-Leo
Original comment by romixlev
on 6 Mar 2012 at 2:38
Attachments:
1. Sure. But as you said, its a non issue.
2. Yep. You can easily have a strategy w/c maps the internal ids (or fqcn) to
another set of ids.
3. Definitely. This will be a good feature for 1.0.5
So I'm guessing you've measured the performance gains with this approach
compared to the last one (or the stock one)? It should be scream.
I'm thinking of making the strategy static final (non-switchable) once
protostuff-runtime is loaded (RuntimeEnv). Configuration will be via system
properties (ala RuntimeEnv). This will be for concurrent
correctness/performance. What do you think?
By using a different classloader, you can load 2 instances of
protostuff-runtime (via jboss-modules) and have your main application
communicate with use both (I might be wrong here).
Anyway, good job! I'll take a closer look at the patch.
Original comment by david.yu...@gmail.com
on 6 Mar 2012 at 4:58
> So I'm guessing you've measured the performance gains with this approach
compared
> to the last one (or the stock one)? It should be scream.
Actually, I haven't measured the performance. Didn't have time and I was more
interested in smaller size of serialized representation rather than in shorter
serialization times. But IIRC, even before the last ArrayList implementations
it was about 2 times faster then the stock version, probably because it had to
write/read 3-4 times less bytes.
It would be cool, if you could use one of the serializer benchmarks to test it.
I'm really curious about the improvements it may give.
> I'm thinking of making the strategy static final (non-switchable) once
protostuff-
> runtime is loaded (RuntimeEnv). Configuration will be via system properties
(ala
> RuntimeEnv). This will be for concurrent correctness/performance. What do
you
> think?
I'm not sure I like non-switchable approach that much. You remember the
use-cases I'm interested in ;-) But I think that adding/removing anything for a
given mapping should be forbidden, once it is in use (i.e. read/write was
called at least once). Otherwise it may lead to unpredictable results.
At the same time, I'd still like to have the possibility to assign a new
default policy. And I still consider the thread-local approach or the like.
I also do not really belive that declaring the strategy as final static will
make things much faster. But that can be just tested using a benchmark, or? If
it really brings something (let's say than 10-15%) than its worse effort. If
not, it would just reduce the flexibility.
>By using a different classloader, you can load 2 instances of
protostuff-runtime
>(via jboss-modules) and have your main application communicate with use both
(I
>might be wrong here).
Actually, I do not use prostuff with JBoss. I use it in standalone apps and try
to use it with Hazelcast, as a replacement for their default serialization.
May be the class-loader approach described by you is possible, but it is too
complex, IMHO. I think it should be possible to assign new global policy.
Looking forward to your review of the patch and benchmark figures!
Original comment by romixlev
on 6 Mar 2012 at 6:26
"I think it should be possible to assign new global policy. "
It is important for the static strategy to be set only once (atleast from a
regular user's POV). But your strategy can make it switchable (your use-case),
by wrapping another strategy (not final, hence switchable) and delegate.
Original comment by david.yu...@gmail.com
on 6 Mar 2012 at 8:21
"It would be cool, if you could use one of the serializer benchmarks to test it"
The existing ones do not have a polymorphic dataset. Will try to make one on
the junit tests.
Original comment by david.yu...@gmail.com
on 6 Mar 2012 at 8:23
Hi David,
Any news regarding benchmarking or plans for including this feature into the
trunk?
Original comment by romixlev
on 17 Mar 2012 at 11:39
This will definitely be included. I'm still trying to find time to sweep the
pending issues and do the next release. (Hopefully this weekend)
Original comment by david.yu...@gmail.com
on 20 Mar 2012 at 10:03
I've attached the modified patch (Note that 22 and 23 are used for enum_set and
enum_map). The test passes for the core formats including json and xml.
Benchmarks are next.
I'll also try to add the non-strict num strategy.
Original comment by david.yu...@gmail.com
on 26 Mar 2012 at 12:50
Attachments:
Here's v2 which allows registration of custom collections/maps (w/c will also
now work without explicit registration when DefaultIdStrategy is used).
This will be committed soon. Let me know if you have any concerns/issues.
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 11:14
Attachments:
I'm wondering if the other id strategies (other than default) should be in
another module.
RegisteredIdStrategy (strict ... already implemented)
IncrementalIdStrategy (generate ids on the fly ... to be implemented)
These 2 files being jarred will result to ~30kb alone.
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 11:18
v3.diff
"I work for a company which does a lot with distributed systems coming from
different vendors. Same encoding/numbering cannot be always ensured. Each
client may be implemented/operated by a different 3rd party, which may have its
own numbering of FQCN ids. Some of their classes are not even known to us in
advance, as they can use our systems to exchange any kind of information,
including their own classes (think about event buses, data grids and similar
frameworks). We would usually receive their serialized messages over a network,
do something with it and eventually forward to a different client (which may
use a different numbering schema). We have no easy way to enforce a unique
non-conflicting numbering. Therefore, I suggested that numbering should be
associated to streams, because this way you associate it with a specific stream
coming from a specific client and one specific client is hopefully using an
internally consistent way of serialization for its own messages.
How would you cover such a use-case with your proposed approach? Or is it out
of scope for protostuff?"
This is now doable (with the v3.diff). You can have an IdStrategy you can use
for specific incoming and outgoing streams.
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 1:37
Attachments:
> Here's v2 which allows registration of custom collections/maps (w/c will also
now
> work without explicit registration when DefaultIdStrategy is used).
Cool!!!
Would it also work for very custom colllections/maps that are not derived from
Collection or Map interfaces? If not, how would one handle them? This probably
relates to our discussion on the Issue 88. But even if it does not work, it is
a relatively seldom use-case probably. I'm asking about it for the sake of
completeness.
> I'm wondering if the other id strategies (other than default) should be in
another
> module.
> RegisteredIdStrategy (strict ... already implemented)
> IncrementalIdStrategy (generate ids on the fly ... to be implemented)
> These 2 files being jarred will result to ~30kb alone.
You are right, may be having them in the dedicated module/jar is a good idea.
If people don't want to use them, they should not carry >= 30KB with them.
Especially on Android devices and the like, where storage space could be a bit
limited.
> This is now doable (with the v3.diff). You can have an IdStrategy you can
use for
> specific incoming and outgoing streams.
Very good!
My God! This is a very big patch. You performed quite a huge refactoring to
propagate the strategy parameter through many, many APIs. I was too lazy to do
that ;-)
I just looked at the diff, but have not applied it yet. But at the first glance
I missed any test case for using an IdStrategy for specific incoming and
outgoing streams. Have I overlooked such a test case or it is not there yet?
Original comment by romixlev
on 27 Mar 2012 at 2:16
Another question:
If I look at AbstractRuntimeObjectSchemaTest.IdStrategyFactory implementation,
I see that it registers not only custom classes, but also the standard ones
like CollectionSchema.MessageFactories.ArrayList and
CollectionSchema.MessageFactories.HashSet.
I'm wondering if it is safer to have a re-usable code that pre-registers all
built-in and standard classes, just to be on the safe side and ensure that for
those classes the encoding is always the same. This code can then be called
from any strategies that protostuff will ship or from custom ones.
What do you think?
Original comment by romixlev
on 27 Mar 2012 at 2:28
"Would it also work for very custom colllections/maps that are not derived from
Collection or Map interfaces? If not, how would one handle them? This probably
relates to our discussion on the Issue 88. But even if it does not work, it is
a relatively seldom use-case probably. I'm asking about it for the sake of
completeness."
You do it the usual way. Create a schema for that collection/map and register
it like a regular message. You'll have explicit control on how it is
serialized.
"I just looked at the diff, but have not applied it yet. But at the first
glance I missed any test case for using an IdStrategy for specific incoming and
outgoing streams. Have I overlooked such a test case or it is not there yet? "
See AbstractRuntimeObjectSchemaTest.
Say for your use case. You basically associate a specific IdStrategy to your
input/output.
Creating a schema from a specific IdStrategy would basically be:
Schema<Foo> schema = RuntimeSchema.getSchema(Foo.class, idStrategy);
That schema and all its dependencies will be based from that id strategy
(isolated).
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 2:34
"I'm wondering if it is safer to have a re-usable code that pre-registers all
built-in and standard classes, just to be on the safe side and ensure that for
those classes the encoding is always the same. This code can then be called
from any strategies that protostuff will ship or from custom ones. "
The RegisteredIdStratedy is strict in a sense that you have full control of
every single detail.
The reason for which is that ids 1-15 only take 1 byte to serialize (with the
protobuf tag). So if a user knows he has more custom collections that are
frequently used (for example he has 15), he will probably wanna allocate the
first 15 ids to it.
Nothing beats full control if you ask me.
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 2:38
> Say for your use case. You basically associate a specific IdStrategy to your
> input/output.
> Creating a schema from a specific IdStrategy would basically be:
> Schema<Foo> schema = RuntimeSchema.getSchema(Foo.class, idStrategy);
> That schema and all its dependencies will be based from that id strategy
(isolated).
OK. I think I got it.
So, one basically creates idStrategy-bound schemas for the required set of
classes used for serialization. And then one obtains different schemas for the
same class by binding it to different idStrategies. One this is done, you just
use a corresponding idStrategy-specific schemas to read/write in a required
serialization format. Right?
One question, just to check:
When implementing transferXXX from input to output stream, do you take into
account that input IdStrategy could be different from the output idStrategy. Or
is it assumed that they are the same in case of tranfser-methods?
> The RegisteredIdStratedy is strict in a sense that you have full control of
every
> single detail.
> The reason for which is that ids 1-15 only take 1 byte to serialize (with the
> protobuf tag). So if a user knows he has more custom collections that are
> frequently used (for example he has 15), he will probably wanna allocate the
first
> 15 ids to it.
> Nothing beats full control if you ask me.
I agree with you.
My point was to provide some sort of a boiler plate code for those, who are not
such a control junkie ;-) If they do not want to control everything up to the
last bit, then having a common predefined way to register most popular types
and classes is just safer and quicker during development. Otherwise they may
need to do several iterations of edit/compile/run before they include all
required standard classes used by their app into the registry.
Original comment by romixlev
on 27 Mar 2012 at 2:51
One more comment:
I have the impression that with your current patch one needs to always provide
a concrete numeric id, when registering a class.
What about a use-case, where registration of classes does not happen in one
place (e.g. in one class or method), but is spread among few places, which are
written by different parties (e.g. by a 3rd party) and therefore do not know
exactly, which id ranges are already used.
May be there should be a convenience method to assign a next free id to a given
class when it is registered?
Or at least a method to get a number of currently registered classes (more
precisely, the highest used id) for a given strategy. Then one can generate the
next id itself by using (maxId + 1) when calling the register method.
Original comment by romixlev
on 27 Mar 2012 at 3:02
That is handled by the IncrementalIdStrategy (a hybrid where you can register
and allocate the first x ids, then the rest can be dynamic).
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 3:33
Attached v4.diff (includes the incremental id strategy).
Kindly take a look at the impl. If you have a better idea on how to improve
its perf, let me know.
If all is well, I'll commit this tomorrow.
The registered and incremental id strategies will likely be on its own module.
Original comment by david.yu...@gmail.com
on 27 Mar 2012 at 6:57
Attachments:
Hi David,
I had a quick look at the v4.diff
Here are my quick comments:
1) RegisterIdStrategy uses ArrayLists for Integer->Class mappings. But
IncrementalIdStrategy uses ConcurrentHashMaps for the same mapping. Is it on
purpose? I guess it introduces some overhead.
2) RegisterIdStrategy and IncrementalIdStrategy do usually at least one lookup
in the Map (Class->Integer) to get the id of the class. The reason for that is
the fact that you pass Class as a parameter almost everywhere in IdStrategy
functions. In my version I passed Schema or HasSchema almost everywhere and
they had getId() method directly. But probably it was doing a similar lookup on
the caller's side to obtain a Schema/HasSchema/SchemaWrapper object from the
value.
3) IncrementalIdStrategy allows to reserve some initial ids and start assigning
ids from a given number. But it does not implement the method "register" and
throws an exception. The comment for this class says "and you can initially
register classes". My understanding was that one can use reserved ids for
initially registered classes and all others get their ids autoassigned. But how
can such classes be initially registered then?
4) Nice to have (was present in my patch): There is no shipped strategy which
would be almost like IncrementalIdStrategy, but would allow to send/serialize a
mapping from a FQCN to the numeric id (i.e. would send both FQCN and numeric
id) when it is used first time. The idea with such a strategy is that the
receiver/deserializer does not need to know in advance the mapping used by the
sender. The receiver can get it from the input stream directly and build a
local strategy on the fly as it reads input. So, the sender more or less sends
a dictionary/mapping entry by entry when it discovers a new entry. The use-case
for such a strategy: you get a serialized message from a 3rd party client and
you have no idea in advance about encoding it uses. So, either you need to
handshake a dictionary/idStrategy with that 3rd party client by some other
means or you use such a strategy like I described and then you can deserialize
such messages as long as classes with mentioned names are present in your
class-loader.
Original comment by Marina.L...@gmail.com
on 27 Mar 2012 at 8:14
One more small comment (I repeat it, as you haven't answered)
When implementing transferXXX (e.g. in Idstrategy-based classes) from input to
output stream, do you take into account that input IdStrategy could be
different from the output idStrategy. Or is it assumed that they are the same
in case of tranfser-methods? My feeling is that you assume at many places that
the same strategy is used for both streams.
Original comment by Marina.L...@gmail.com
on 27 Mar 2012 at 8:19
First of all, Marina == romixlev?
1. ConcurrentHashMap is needed as it needs to be thread safe and the ids are
atomically allocated. RegisteredIdStrategy on the other hand is immutable
hence you don't pay for the concurrent data structures.
2. "In my version I passed Schema or HasSchema almost everywhere and they had
getId() method directly"
Basically you looked up the schema first then you handed it to IdStrategy.
Here, the IdStrategy is the one who looks up first then returns the schema. So
now time lost here. Basically the lookup is delegated from
ObjectSchema/DerivativeSchema to the IdStrategy.
3. Ah yes, I'll add public methods to do so. Note that the register method
does not take an it (It is only there for backwards compatibility ... ensuring
the others who use RuntimeSchema.register(clazz) will work out-of-the-box
because its implemented on DefaultIdStrategy)
4. See #5.
5. It must be the same IdStrategy otherwise the id lookups would fail. You
can although implement a custom strategy for your needs reads from a
DefaultIdStrategy (fqcn) and writes an id (ala
RegisteredIdStrategy/IncrementalIdStrategy)
The idea is on transfer, you:
1. Read the string fqcn
2. Load the class
3. Assign an int id
4. Write the id
As you said, this will work as long as the class is available on the receiving
end's classloader.
Use the existing strategies as a template and customize to fit your needs 100%.
Original comment by david.yu...@gmail.com
on 28 Mar 2012 at 3:10
> First of all, Marina == romixlev?
Oops. Sorry. Yes, it is me, romixlev. Obviously, I forgot to log out from my
wife's gmail account at home, when I was submitting a comment ;-)
> 1. ConcurrentHashMap is needed as it needs to be thread safe and the ids are
> atomically allocated. RegisteredIdStrategy on the other hand is immutable
hence
> you don't pay for the concurrent data structures.
I see your point. But I think that the price/overhead is too high. IMHO, a
typical scenario would first (or let's say in the initial phase) register
explicitly some classes with reserved ids and then auto-register classes using
the usual IncrementalIdStrategy's approach. Once this is over, there are no
changes to be expected, especially changes that happen often. Therefore using
ConcurrentHashMap could be an overkill. May be some sort of locking in the
initial phase only and later no locks? E.g. introducing some sort of a flag
that would indicate that the registry is built now and no further changes are
possible? And methods to set/reset it (sort of freeze/unfreeze)? If this flag
is set, then registry becomes immutable and there is no need for any locking.
If flag is not set, then registry is still being built and locks should be used
when accessing the mappings....
Do you think something can be tweaked along these lines to make performance
better for the majority of typical scenarios using this strategy?
2) Regarding the lookups moved from ObjectSchema to IdStrategy: yes, you are
right. This is what I guessed.
5) I see your point. My version also didn't implement it, i.e. using different
strategies for input and output. Doing so in the most general case would
require an association between streams and strategies, I guess. Then one could
say something like:
inPojoId = in.getStrategy().resolvePojoId...();
outPojoId = ...; // find in the out.getStrategy() what is the id corresponding to inPojoId class
out.getStrategy().writePojo(outPojoId);
But let's forget about it for now. It is not the most important feature ;-)
Original comment by romixlev
on 28 Mar 2012 at 6:52
With a hard limit, we can use an array list for the ids instead of
ConcurrentHashMap.
E.g, you specify a max, then the array list will allocate once and will never
resize (and will be protected by a volatile field as the happens-before
barrier).
Will attach the final patch and I think its good to go.
Original comment by david.yu...@gmail.com
on 28 Mar 2012 at 7:31
> With a hard limit, we can use an array list for the ids instead of
> ConcurrentHashMap.
Yes.
> E.g, you specify a max, then the array list will allocate once and will never
> resize (and will be protected by a volatile field as the happens-before
barrier).
> Will attach the final patch and I think its good to go.
In ideal case, it would be nice to avoid setting max in advance, as you
probably don't know the sharp max limit (though you can always set it to a
reasonably big number, which should be OK for sure). That's why I was thinking
about using something resizable at the beginning and then, once one says
"freeze" it becomes immutable and can use a more efficient implementation
(ArrayList) from now on. But this is probably too complex compared to just
reserving enough space for max possible number of items.
Original comment by romixlev
on 28 Mar 2012 at 8:18
Original issue reported on code.google.com by
hpgisler...@gmail.com
on 2 Dec 2011 at 7:16