Open GoogleCodeExporter opened 9 years ago
It's a neat idea.
The largest savings comes from avoiding writing class name strings. This would
be a good improvement. Using class name strings at all is a suboptimal path,
but sometimes can't be avoided. The first byte DefaultClassResolver writes is
either for null, to denote the next bytes are a class name, or a class ID. We
could add another value to denote if the default serializer should be used.
This feature should be handled by the framework, I don't think it needs to
affect serializers.
Using a bitset header to make serialization more efficient could be nice,
though it complicates the serializer implementation.
Recently FieldSerializer's complexity has increased quite a lot. I think it
should be made easier to follow if possible, especially before adding new,
complex features. I would love to see functionality separated out of the 1000+
line class if it makes sense. Possibly there could be a utility class that
deals with generics. Maybe another that deals with cached fields. Maybe it
makes sense to have a separate cached field utility for unsafe. Maybe
FieldSerializer has an instance of these classes and they are configured
directly, or maybe multiple FieldSerializer-like classes is better. What do you
think?
Original comment by nathan.s...@gmail.com
on 22 Aug 2013 at 7:20
> This feature should be handled by the framework, I don't think it needs to
affect serializers.
It depends. If you consider those serializers which were registered using
addDefaultSerializer or setDefaultSerializer as default ones, then yes, we
don't need to affect serializers. But if we have something like an instance of
a map serializer with default keyType or valueType set explicitly for it (e.g.
as a guess based on probing the first entry and/or by inspecting generic type
parameters), then "default" means a serializer specific setting and I'm not
sure that it can be handled completely by the framework without involving some
logic inside a serializer. But this is just a feeling at the moment. I need to
get my hands dirty with the implementation to better understand it.
> Using a bitset header to make serialization more efficient could be nice,
though it complicates the
> serializer implementation.
True. That's why I listed it a possible improvement, but not as a main
embodiment of the idea. It may require "backpatching" of the header after all
fields are processed, which could be difficult if the whole object does not fit
into a buffer.
Regarding the complexity of the FieldSerializer:
I totally agree that it is way too complex now. I could be a good idea to split
it into parts, each handling its own feature (e.g. generics, cached fields,
unsafe, etc). I'll see what can be done.
Original comment by romixlev
on 22 Aug 2013 at 7:51
I did a first step in re-factoring FieldSerializer. A lot of generics-related
and unsafe-related logic is moved into dedicated classes. FieldSerializer is
only about 620 lines of code now. All unit tests are still green. The code is
already committed.
Original comment by romixlev
on 22 Aug 2013 at 11:27
I have some progress on a more compact representation - the simple version
described in the proposal seems to work fine for FieldSerializer.
I'm looking now into a way how we handle references. Currently, we write one
byte for each new object. In principle, it could be avoided. We always know
when a new object starts and therefore we don't need a special marker for it.
Instead, we could write a special marker followed by the number of the object
only when we really refer to a previously seen object.
Together, it would result in something like:
- Each object of a primitive class is written without any headers
- If we have an object whose real class is the same as its declared class, we
just write a special tag to indicate this
- if we have an object whose real class is different from its declared class,
we write a special tag indicating that a class id is following
- If we refer to a previously seen object, we write a special tag indicating
that it is a reference and then the number of the object
When we deserialize, at any place where we expect an object we check the first
byte which contains a tag:
- if it is a reference tag, then it is a reference to a previously seen object
- if it is a special tag used to indicate that a real class is the same as the
expected/declared class, then we know that we read an object of the expected
class
- if it is a class id tag, we read a real class of the object after it
The difference from the current approach is:
- We do not output a dedicated "reference tag" byte before each object
- We output a class id tag before each class id, which is written (one could
say that we reuse the reference tag byte from a previous bullet for a different
purpose)
- If class id is the same as expected/declared class, only a single byte tag is
written, without writing a real class id
- We output a dedicated "reference tag" before each real reference inside
object graph
As far as I can see, the old approach always produces a representation which is
at least as long as a new proposed representation.
Original comment by romixlev
on 21 Oct 2013 at 10:13
When we deserialize, when we expect an object, if no tag was written how do we
know if they byte we read is a tag or data? Are only primitives (and strings)
written without a tag?
Most class IDs (0 to 127) are 1 byte. It seems with the proposal we write out
fewer bytes in some cases and more bytes in others. Not writing bytes for
primitives would likely be worthwhile.
Original comment by nathan.s...@gmail.com
on 21 Oct 2013 at 12:42
> When we deserialize, when we expect an object, if no tag was written how do
we know if they byte
> we read is a tag or data? Are only primitives (and strings) written without a
tag?
Yes. This is the idea. Only fields of primitive types don't have a tag. But we
know this statically, so we we know that we expect data at this place.
> Most class IDs (0 to 127) are 1 byte. It seems with the proposal we write out
fewer bytes in some
> cases and more bytes in others.
I'd say that in case when references enabled, the new proposal is strictly
better in most cases (because we don't write a reference byte before each
object, but write a byte before each class id, i.e. we don't introduce any
additional bytes). In case, where we don't use references, it could be that the
old approach is better, because we don't write any reference tags at all and we
don't write any additional tags for class ids.
> Not writing bytes for primitives would likely be worthwhile.
Sure. But we don't write them even now, or? I.e. we don't write any tags for
ints, floats, etc. And we don't write any tags for fields declared using final
classes anyway. So, which use-cases do you have in mind here?
Original comment by romixlev
on 21 Oct 2013 at 2:52
> So, which use-cases do you have in mind here?
Sorry, you're right. I was confused trying to figure out how the proposal is
better. How exactly is the proposal better? What does writing a class ID tag
buy us if we assume a class ID is 1 byte (which is common)?
Writing more bytes when references are disabled isn't great. So far that has
been the most efficient mode, both in speed and size.
Original comment by nathan.s...@gmail.com
on 21 Oct 2013 at 3:08
As I mentioned, it brings some obvious advantages in case when references are
enabled. But in case when they are disabled and we have only 1-byte class ids,
it would not bring any benefits.
So, we should probably use it only when references are enabled, if at all.
Or may be we can avoid the explicit byte for a class id tag . We can just treat
class ids as a special kind of tags. The first few ids are reserved as special
tags (e.g. REFERENCE tag, SAME AS DECLARED CLASS tag, etc). All others values
are simply treated as class ids. This is similar to what we do now. The
difference will be that we don't write reference tag byte before each object.
We would write it only when we really refer to an object. And we don't write
an explicit class id, if the real class is the same as a declared class. Now
it seems to be at least as efficient as our current approach or better in case
when references are enabled. I think it covers all cases now, but I need to
think a bit more about it.
Original comment by romixlev
on 21 Oct 2013 at 3:23
The main improvement would be then that we use the same byte to indicate a
reference tag and class ids, which is different from our current approach where
we use separate bytes for it. It would also mean that we would need to
read/write references and class ids using a single method, because it would
need to read/write a byte and treat it as a reference or class id depending on
its value.
Original comment by romixlev
on 21 Oct 2013 at 3:35
> As I mentioned, it brings some obvious advantages in case when references are
enabled.
It still isn't clear to me what the advantages are. You haven't explicitly
stated them.
> The difference will be that we don't write reference tag byte before each
object. We would write it only when we really refer to an object.
Not sure what you mean here, sorry. When references are on, we currently write
a varint that both indicates null/not null and the reference ID.
Original comment by nathan.s...@gmail.com
on 21 Oct 2013 at 3:39
>> The difference will be that we don't write reference tag byte before each
object. We would write it
>> only when we really refer to an object.
> Not sure what you mean here, sorry. When references are on, we currently
write a varint that both
>indicates null/not null and the reference ID.
Currently, when references are on, we write a varint (single byte) before EACH
object we see for the first time, just to mark it. We do it even if this object
is never referenced later. The new proposal avoids doing it. It only writes a
dedicated tag (single byte) and a reference ID (varint) every time a real
reference is discovered. Since in most cases no real references are discovered
in the object graph, we save a byte per object for almost all objects.
Original comment by romixlev
on 21 Oct 2013 at 4:27
[deleted comment]
Got it, thanks. The reference ID we currently write can also mean null/not
null, so writeObjectOrNull won't be able to avoid writing a byte before the
object data when serializer.getAcceptsNull() is false. A number of serializers
accept null, but FieldSerializer doesn't (for now, maybe in the future it uses
a bitset for nulls), which means most objects need a null denoting byte. This
limits the byte savings from the proposal. Still, I like the idea and if we can
save bytes for serializers that accept null, that would be great. However, I
worry about how it might impact the API, since a method like writeClass
currently doesn't need to know about references. Having Kryo handle references
relatively transparently (there is just the Kryo#reference() magic) is nice
API-wise.
Original comment by nathan.s...@gmail.com
on 21 Oct 2013 at 11:00
Yet another improvement which I already implemented in my local branch is like
this:
- If registration of classes is not required by a Kryo instance, we currently
output a class name and assigned class ID, when we see this class for the first
time and later we output 2 bytes, every time we see this class. First byte has
a value of NAME (which indicates that it is a non-registered class) and the
second byte (or more precisley varint) contains the id of this unregistered
class.
- What I did is to use positive and negative numbers as class ids. If class id
is positive, then it is a usual registred class ID. But if we need to encode a
non-registered class, we simply use a negative id.
- This approach is able to cover most of the typical cases using only 1 byte,
independent of the fact if a given class is registered or not, which is an
improvement.
- Small disadvantage: since it uses one bit for a sign, it can now encode in
one byte only a half of registered classes, which could be encoded with the
current approach, i.e. only 64 classes.
Original comment by romixlev
on 31 Oct 2013 at 4:32
Kryo was originally optimized for a networking protocol, so registering all
classes, not using references, etc. Keeping the number of registered class IDs
that can be representing with 1 byte was more important than not using
registration, which is already not a fast path since we are writing class
names. I could go either way on your change, though I lean a bit toward not
optimizing when registration is disabled if it affects the faster paths.
Original comment by nathan.s...@gmail.com
on 31 Oct 2013 at 8:51
Original issue reported on code.google.com by
romixlev
on 21 Aug 2013 at 4:22