Switch to textual serialization format

jglick commented 11 years ago

Would be better to finally drop use of Java serialization, and switch to some reasonably compact, Unicode-safe format that supports persistence of the things SezPoz needs.

Ideally in a compact text format. JSON would work if it is possible to embed (shade) a very small parser/generator.

Existing serialized indices would of course still need to be loaded for compatibility.

Originally SEZPOZ-2.

dscho commented 11 years ago

I'll have a crack at it! This will also make @ctrueden happy.

jglick commented 11 years ago

JSON possibilities in Central: net.minidev:json-smart; com.eclipsesource.minimal-json:minimal-json; com.googlecode.json-simple:json-simple. https://github.com/mmastrac/nanojson also looks promising though apparently not in Central.

jglick commented 11 years ago

Careful with primitive types/wrappers, though; JSON has no equivalent to Character, and does not accurately distinguish between numeric types.

jglick commented 11 years ago

On reflection, XML would probably do fine, and avoid the need for any new dependency. Main concern is handling of XML-reserved characters: not just < and & and " but U+0000–U+001F and some others (which are unlikely but legal parts of attribute values). I bet a SAX parser with no validation or namespace support would be sufficiently fast for this purpose (LazyIndexIterator.peek); can check in perftest module.

Suggested format for META-INF/annotations/$name.xml (with unnecessary whitespace stripped by default, perhaps overridable via processor option):

<?xml version="1.0" encoding="UTF-8"?>
<r> <!-- need some wrapper element -->
  <e c="the.Clazz"> <!-- SerAnnotatedElement -->
    <v n="attributeName"> <!-- one value of annotation -->
      <s>text</s> <!-- String value -->
    </v>
    <v n="arrayValued"><l><s>first</s><s>second</s></l></v>
    <v n="annotationValued">
      <a n="the.Annotation">
        <v n="itsOwnOptionalValue"><s>…</s></v>
      </a>
    </v>
    <v n="classValued"><c n="the.ClassValue"/></v>
    <v n="enumValued"><n e="the.EnumType" c="CONSTANT"/></v>
    <v n="byteValued"><B>-13</B></v>
    <!-- or could use numeric value, but less readable: -->
    <v n="charValued"><C>@</C></v>
    <v n="doubleValued"><D>123.45E15</D></v>
    <v n="floatValued"><F>-0.01</F></v>
    <v n="intValued"><I>17</I></v>
    <v n="longValued"><J>1234567890123</J></v>
    <v n="shortValued"><S>55</S></v>
    <v n="booleanValued"><Z>true</Z></v>
  </e>
  <!-- empty element if no values: -->
  <e c="other.Clazz" m="methodName"/>
  <e c="other.Clazz" f="fieldName"/>
</r>

Typical contents for e.g. META-INF/annotations/hudson.Extension.xml would be reasonably compact:

<?xml version="1.0" encoding="UTF-8"?><r><e c="my.plugin.Extension1"/><e c="my.plugin.Extension2"/></r>

jglick commented 11 years ago

Actually could use XMLStreamReader since we depend on Java 6 now anyway.

dscho commented 10 years ago

I did a completely home-grown JSON-like thing ;-)

kmader commented 10 years ago

Is there any status on this issue? This is a blocker issue for me because the binary serialized output format means making uberjars containing multiple jars with sezpoz annotations is not possible. The standard Maven Shade plugins for resource combining (appending, xml-appending, http://maven.apache.org/plugins-archives/maven-shade-plugin-1.7.1/examples/resource-transformers.html) do not work with the annoations produced by sezpoz

dscho commented 10 years ago

@kmader well, we switched away from Sezpoz and implemented our own annotation processor. Since we used Sezpoz before, we even have legacy support to use (but not generate) Sezpoz-compatible annotation indexes (even with class path libraries different from Oracle's). It is BSD licensed, so feel free to steal^Wuse it.

kmader commented 10 years ago

@dscho Thanks for the suggestion, it looks like it is very similar in API to SezPoz. Does it handle @Target(ElementType.FIELD) as I have swapped it out in my current code and am getting error: Cannot handle annotated element of kind FIELD error messages at compilations.

jglick commented 10 years ago

Looking back at this, both JSON and XML seem like overkill. And JSON is not really that desirable, since a useful property is appendability with built-in Maven aggregators, for which you only have line-by-line or XML—JSON would still require a custom aggregator.

Whether using JSON within a line or not, parsing can be simplified by writing all values as strings, not trying to use boolean/numeric primitives at all. For example, in a simple non-JSON format (with runtime type checking), my previous XML example might read:

the.Clazz attributeName="text" arrayValued=["first" "second"] annotationValued={itsOwnOptionalValue="…" anotherValue="…"} classValued="the.ClassValue" enumValued="CONSTANT" byteValued="-13" charValued="@" doubleValued="123.45E15" floatValued="-0.01" intValued="17" longValued="1234567890123" shortValued="55" booleanValued="true"
other.Clazz#methodName()
other.Clazz#fieldName

dscho commented 10 years ago

Thanks for the suggestion, it looks like it is very similar in API to SezPoz

That is by design: we started out with SezPoz, but at some stage it became clear that we have slightly different requirements than SezPoz is prepared to address (in particular, we wanted to be free to use different class path libraries than Oracle's, i.e. be independent on the specifics of the Java serialization of the map used by SezPoz).

Does it handle @Target(ElementType.FIELD)

No, we only need the annotation processing for classes, therefore we stripped out the support for field or method annotations. It should not be hard at all to get that support back in, though.

jglick commented 10 years ago

we have slightly different requirements than SezPoz is prepared to address

Not really, I think. #7 foundered at the time, but the goal remains the same. Ensuring ongoing compatibility with the Android JVM still seems tricky (I am not sure how to mechanically test it), but if the only relevant difference is that they do not intend to comply with the 1.5+ serialization spec of HashMap, then any textual format would work around this, and be nicer for debugging anyway.

jglick commented 10 years ago

The fork also seems to have removed the instance type parameter to IndexItem, and the whole instance() method, so it is not a drop-in replacement I am afraid.

dscho commented 10 years ago

we have slightly different requirements than SezPoz is prepared to address

Not really, I think.

Well, given that I outlined our requirements (which disagree with using Oracle's class path library's specific serialization), I fail to see how my statement is wrong...

But let's just let this conversation die: we already had it, it was not exactly fruitful, and the outcome is now history and everybody can live with it. Case closed.

jglick / sezpoz

Switch to textual serialization format #6