Serialize indexed annotations as JSON

dscho commented 10 years ago

To support Java classpath libraries whose TreeMap has a serialization that is incompatible with Oracle Java's (Android, I am looking at you), we have to decouple SezPoz from the vagaries of TreeMap's serialization.

To maintain backwards-compatibility, we still accept .jar files whose annotations were serialized by a previous SezPoz version.

This is intended to fix issue #6.

buildhive commented 10 years ago

Jesse Glick » sezpoz #5 FAILURE Looks like there's a problem with this pull request (what's this?)

dscho commented 10 years ago

Oops... BuildHive reports a failure because Java 1.5's IOException has no constructor taking a Throwable. Will fix!

This pull request was actually not yet meant for merging anyway because I have not tested it on our project using SezPoz yet (and I want to make sure that that project continues working as expected with my changes).

dscho commented 10 years ago

Also, to address concerns raised in https://github.com/jglick/sezpoz/issues/6 about JSON: the obvious way I fixed the absence of character values was to introduce a SezPoz-specific extension: my JSON "serialization" interprets quoted value just like Java: double quotes denote Strings and single quotes denote chars (the extension to JSON being the single quote notation, of course).

Please note, too, that it is not really a serialization because the exact types are not written out, but instead the SezPoz-specific classes are mapped to what JSON calls "objects" (which really are maps of key/value pairs). When "de-serializing", this information needs to be incurred from the annotation's definitions rather than the "serialized" data.

As to why choosing JSON over XML, then: I really did not like the complexity the latter adds. Sure, there is XML support in Oracle Java's class path library. But the whole point of this here exercise is to make SezPoz less dependent on Oracle Java's class path library...

buildhive commented 10 years ago

Jesse Glick » sezpoz #6 SUCCESS This pull request looks good (what's this?)

dscho commented 10 years ago

I actually think that I can do one more very nice thing: while we're already deviating from the JSON standard, we might just as well special-case annotations, too, so that they are not written out like {name:"MyAnnotation",values:{}} but as @MyAnnotation{} instead.

buildhive commented 10 years ago

Jesse Glick » sezpoz #7 SUCCESS This pull request looks good (what's this?)

dscho commented 10 years ago

I still would like to perform some tests (in particular to make sure that all possible annotation field types are covered), but this is already pretty much my intended final form of this pull request (modulo possible bug fixes for above-mentioned things).

buildhive commented 10 years ago

Jesse Glick » sezpoz #8 SUCCESS This pull request looks good (what's this?)

ctrueden commented 10 years ago

Why the necessity to type everything so strongly in the "JSON" output? Is there something wrong with just writing strings for everything? Then when reading back in, use the types from the loaded annotation class, converting the strings back into those types via a mechanism similar to SJC's ConvertionUtils class. This strategy would even limitedly resist changes to the annotation structure, rather than imposing a completely rigid type-specificity to it.

dscho commented 10 years ago

@ctrueden it is a matter of practicability. Of course, there is code that says "I expected a Class but got a String" and automatically loads the appropriate class in the appropriate ClassLoader: https://github.com/dscho/sezpoz/blob/72c0076013adfe488572b8ae2e078ff99ccac9de/sezpoz/src/main/java/net/java/sezpoz/IndexItem.java#L339

Likewise, I had code originally to convert a Map to a SerAnnConst if the latter was expected. It's just that an unintuitive number of tests broke because the toString() method used to compare expected to actual values converted the annotations into {"name":"MyAnnotation","values":{}} instead of the previous @MyAnnotation{}. Now, I could have changed the tests (and in my original version I did), but then, I could also simply extend the JSON format (because we already had to extend the JSON format to account for character values, therefore regular JSON tools could not parse our serialized annotations anyway -- and it is not clear to me why third-party tools would need to access our serialized annotations in the first place).

ctrueden commented 10 years ago

I probably do not have a full grasp on the details here, but my point is: why did we have to extend JSON to account for char values? When unmarshalling the char field, we already know it is a char field since the JVM has the annotation class loaded. Why not store it as a single char string in the JSON and convert it as needed as part of the deserialization? I.e.: why store type information in the JSON at all?

dscho commented 10 years ago

@ctrueden because it makes it easier to work with the existing code (which kind of assumes that the deserialized data already have the appropriate types)...

If @jglick feels strongly about it, I will do exactly what you suggested: use a String with a single character to encode character values, and switch back to using maps even for serializing annotations.

I'm not married to the way I do things now...

dscho commented 10 years ago

Hmm. After having thought about it for a little while I tend to agree with @ctrueden that it is completely unnecessary to spoil the compatibility with the JSON standard even if we do not intend to support direct access to the serialized data: it would violate the law of least surprise if we extended the JSON standard when we do not have to.

buildhive commented 10 years ago

Jesse Glick » sezpoz #9 SUCCESS This pull request looks good (what's this?)

dscho commented 10 years ago

Note: there must still be a subtle bug lurking somewhere because I still get test failures with ImageJ2... I'm on it.

buildhive commented 10 years ago

Jesse Glick » sezpoz #10 SUCCESS This pull request looks good (what's this?)

dscho commented 10 years ago

Hah! I knew it! While undoing the JSON extension for character values (and also the annotation extension, so that we use plain, standards-compliant JSON), I thought so hard about adapting IndexItem to convert String values when the expected type is a char that I actually forgot I had not added that handling yet!

So now things are totally groovy. I tested with ImageJ2 -- which is a quite intensive user of SezPoz -- and it works as expected, even when I use the new SezPoz with the original .jar files (including the original Java serializations).

Therefore, from my side, I'd like to humbly register that I think this topic branch is good to go. @jglick could you have a look?

buildhive commented 10 years ago

Jesse Glick » sezpoz #11 SUCCESS This pull request looks good (what's this?)

buildhive commented 10 years ago

Jesse Glick » sezpoz #12 SUCCESS This pull request looks good (what's this?)

jglick commented 10 years ago

Well this was a lot of work but I wish you had asked for some feedback on the design first.

Writing a green-field custom JSON parser and printer is not maintainable. There are several (small) JSON libraries out there with considerable test coverage and maturity. Just embed one. (When JSON support is standardized in Java we should be able to switch to it.)
Do not try to magically accommodate JSON values to an expected class signature. The contents of a serialized annotation should be entirely self-describing, i.e. you should be able to read out a List<SerAnnotatedElement> from an InputStream with no context.
JSON numeric types are not directly interchangeable with Java numeric types; you cannot just cast a double to a float or int and expect to get the original value back. It is not really safe to use JSON numbers at all for numeric types; just read and write these as strings (with explicitly indicated types), and interpret the strings using the standard java.lang methods.
Sniffing file format just makes for unnecessary complexity. Simpler to just write new-format data to a different filename. If JSON, it is polite to make the filename extension .json so that syntax-aware editors can display is properly.
The new format seems unnecessarily verbose, though there does not seem to be a summary of its full syntax anywhere.
There is no reason for toString of IndexItem to change; just read the original SerAnnotatedElements.

By the way, regarding XML

there is XML support in Oracle Java's class path library

XML support is defined by the Java standard and must be present in any compliant Java implementation. (StAX in particular is mandated in Java 6, which is now the minimum supported version of Java for SezPoz.) If you are shooting for some Java subset, you need to be clear what that is, and make sure the Animal Sniffer check validates it.

dscho commented 10 years ago

@jglick regarding the biggest issue first: yes, XML support is defined by the Java standard. The problem is that I need to use our library in another context than Oracle Java, and in that library, we make heavy use of SezPoz. In particular, I want to run our library in a C++ context and found that the Avian VM has a substantially more sensible startup time than Oracle's JRE, its memory footprint is nicer, too, not to mention the size of the fact that Avian can be embedded with a mere 1.5 megabyte C++ library, and it supports AOT compilation of Java. Avian's class path library is limited, though, and does not have any support for XML. It now supports Java (de-)serialization, and it works correctly with the current SezPoz version unless you use Avian's Android class path library binding. There, it fails.

As to...

I wish you had asked for some feedback on the design first

That's not the way I work. I do not ask for permission to work on Open Source. I put out some working code, discuss aspects about it, and follow Boyd's Law of Iteration after that.

I am very sorry if you have a problem with that ;-)

But hey, no problem, if this JSON support is not accepted, I will simply fork (we already have to maintain a custom SezPoz version because the crucial fix I submitted earlier that we absolutely require for ImageJ2 to work properly did not make it into any release yet).

However, I would like to avoid forking eternally because of all the incurred cost. And neither would I like to see official SezPoz continue to have the issues I need to solve for our use case.

In short, I do not think that there is a valid alternative to changing the way SezPoz writes out its annotation indexes, and I also do not believe that there is a valid alternative to introducing the complexity necessary to support reading back annotation indexes written out by previous SezPoz versions. I seriously hope that we can agree on those two points, hopefully more than that, but I'd prefer not to settle for anything less.

Regarding your individual concerns:

Writing a green-field custom JSON parser and printer is not maintainable. There are several (small) JSON libraries out there with considerable test coverage and maturity. Just embed one.

That argument does not manage to convince me, because:

JSON is really, really, really easy. The parser is dirt-simple. It was way simpler and quicker to write that parser than to figure out what candidate third-party library is A) small enough, B) works well, and C) is well-maintained.
We do not require a full-blown JSON parser (even if that would still be simple). We only need to serialize the data types annotations can hold (i.e. the primitive Java types, Class instances, enum values, and nested annotations) plus SezPoz' four custom data types for serialization. And for that small requirement, the burden of adding another dependency was on the wrong side of my cost/benefit sheet.

(When JSON support is standardized in Java we should be able to switch to it.)

Absolutely not, because that would break backwards-compatibility with Java versions up to and including Java 7.

Do not try to magically accommodate JSON values to an expected class signature. The contents of a serialized annotation should be entirely self-describing, i.e. you should be able to read out a List<SerAnnotatedElement> from an InputStream with no context.

And again, I am sad having to report that I am not convinced.

First of all, SezPoz version 1.9 does not serialize List<SerAnnotatedElement>. It serializes the SerAnnotatedElement objects one by one, and for a good reason: the way it is done by 1.9 can be de-serialized one by one, no need to look at all the remaining serialized annotations if the caller of the Index data structure determines that it found the annotation it looked for already.

But let's continue on the assumption that you actually meant a stream of serialized objects, and that your point was essentially that those objects have to come off the InputStream as SerAnnotatedElement objects referencing Ser{Ann,Enum,Type}Const objects identical to the ones that were used to serialize.

I have to admit that it is hard for me to see why the strong desire to have that, long before the values are actually used? Why not have a lazy evaluation of the exact type when and if they are used?

This is strongly related to the point @ctrueden raised earlier (when he mentioned that there is absolutely no need to extend the JSON format for SezPoz, and his argument convinced me): why does the JSON output need to know the exact data types? There is really no defensible reason for that.

In contrast, there is actually a good reason to avoid encoding the exact data type in the JSON output: imagine that somebody wants to change their annotation so that a former Class field is now more free-form and takes a String instead that can be interpreted as a class name, but also as something else. This fails to work with the current SezPoz version, but it would work as expected with a JSON-backed SezPoz (without strong typing).

Besides, and this is more fundamental: unless there is a good reason for it, one should not store redundant information. And the information about the annotations' fields' data types is already stored with the annotations' classes.

Therefore the fact that SezPoz stored those implementation details always struck me as an unfortunate side effect of the use of Java's serialization (which BTW always struck me as an unfortunate design decision that was probably born more out of convenience than anything else, because it is quite obvious that there are serious compatibility concerns with already-compiled code imposed by the exact serialization details -- as the Android example demonstrates in full color).

JSON numeric types are not directly interchangeable with Java numeric types; you cannot just cast a double to a float or int and expect to get the original value back. It is not really safe to use JSON numbers at all for numeric types; just read and write these as strings (with explicitly indicated types), and interpret the strings using the standard java.lang methods.

Well, I do not really use JSON numbers, do I? I write out the numeric values as Strings and then parse them back as at least compatible type: the bit sizes of the primitive types pretty much guarantee that you can safely parse an int or float as a double and get the original value back if you cast to the original type. In particular, the cast chains long > int > short > byte and double > float as well as double > int should be safe.

So the only issue might be that the conversion of a double to a String might be so imprecise as to make the conversion back from String to double incorrect. But that is not the case according to the official Java documentation:

[...] There must be at least one digit to represent the fractional part, and beyond that as many, but only as many, more digits as are needed to uniquely distinguish the argument value from adjacent values of type double. [...]

But I would be convinced by a counter-example.

Sniffing file format just makes for unnecessary complexity. Simpler to just write new-format data to a different filename. If JSON, it is polite to make the filename extension .json so that syntax-aware editors can display is properly.

That is a good point, although we do not have any file name extension to reflect the type as of now either, and one should probably consider the META-INF/annotations/ space as data internal to SezPoz that should not be accessed directly, but only through the SezPoz library.

Also, adding the .json file name extension will actually result in an increase of complexity relative to my current solution: every single current user of the ObjectInputStream (there are three of them) will have to implement extra handling for JSONInputStreams that duplicates pretty much the complete handling for the ObjectInputStreams -- a clear violation of the DRY principle.

Also, your suggestion cannot be used as-is because it introduces an ambiguity. Example: old-style serialization of a class called my.packageOrClass.json and new-style serialization of a class called my.packageOrClass would both want to write into META-INF/annotations/my.packageOrClass.json, and therefore we still would not be able to rely on anything ending in .json to actually be in JSON format.

In short: I see your point, although it needs refinement. I suggest using META-INF/json-annotations/<class-name>.json instead of META-INF/annotations/<class-name>.json.

But again, it will introduce an increase in non-DRY-ness, and therefore not a reduction of complexity relative to my current approach at all, but instead increase complexity.

Please note also that the "sniffing" does not add any complexity to the parser: due to the number parsing, it needs to implement a PushbackInputStream anyway, so that all the complexity added by the "sniffing" are these lines to JSONInputStream's constructor:

    bufferedByte = in.read();
    if (bufferedByte == 0xac) {
        PushbackInputStream pushBack = new PushbackInputStream(in);
        pushBack.unread(bufferedByte);
        fallBack = new ObjectInputStream(pushBack);
        this.in = null;
    } else {
        ...
        fallBack = null;
    }

these lines to JSONInputStream's readObject() method:

    if (in == null) {
        return fallBack.readObject();
    }

and only three additional lines to JSONInputStream's close() method.

In short, the complexity is rather limited as compared to adding case logic three times that checks both an ObjectInputStream and a JSONInputStream.

It might be wiser to simply rename JSONInputStream to something like IndexReader to stop suggesting that this class is only about reading JSON-formatted SezPoz indexes.

If you insist, however, I will change the code.

The new format seems unnecessarily verbose,

Sorry, but no. Are you serious?

The new format is substantially less verbose than the XML one you suggested yourself.

And it is still comparable to the existing Java serialization, because it lists essentially the same information.

If you refer to the use of field names like className or enumName that I used verbatim, sure, I can fix that. But it will not result in a huge reduction of the output because the output essentially contains all the information necessary, with only little in the way of fluff imposed by the use of the JSON format.

though there does not seem to be a summary of its full syntax anywhere.

Yeah, sorry, the non-obvious parts of the syntax (read: the handling of SezPoz-specific classes) is documented only in the commit message of https://github.com/dscho/sezpoz/commit/76cc8388c9c5eeb134f701d789db3ac9c957d1e7 so far. I will include the same information in the Javadoc for JSONOutputStream.

There is no reason for toString of IndexItem to change; just read the original SerAnnotatedElements.

You lost me there. I did not touch the toString() method of IndexItem, unless you see something in https://github.com/jglick/sezpoz/pull/7/files that I don't.

If you are shooting for some Java subset, you need to be clear what that is, and make sure the Animal Sniffer check validates it.

That's an interesting suggestion. I am just not quite sure how to implement it because there are two parts to SezPoz: the annotation processor and the library to access the serialized annotation indexes. The latter is the part I am concerned about possible over-use of Java's class path library's not-quite-recent features, not the former, but I fear that my lack of knowledge about the Animal Sniffer does not allow me to tell it to verify that the access to the annotation indexes (as opposed to the generation of them) should basically only use Java 1.0's class path library (because it actually does not need to use anything else).

I really hope that we can bring this topic to a conclusion to both of our satisfactions. It would be a shame if we could not find agreement over the solution to the -- quite real -- problem my pull request tries to address.

dscho commented 10 years ago

I don't know why I pressed the wrong button.

buildhive commented 10 years ago

Jesse Glick » sezpoz #14 SUCCESS This pull request looks good (what's this?)

buildhive commented 10 years ago

Jesse Glick » sezpoz #15 SUCCESS This pull request looks good (what's this?)

dscho commented 10 years ago

I changed the JSON output to use less verbose fields: class instead of className, enum instead of enumName, etc.

For reference: the ij-core library's annotation index originally was 3849 bytes (with Java serialization), with my previous JSON serialization, it was 4406 bytes, and now it is 4106 bytes.

I will now analyze why it is still larger than the serialized data (which surprises my, quite frankly, because the serialized data should have information about both field names and field types, and the numeric constants we use should not be all that much larger in String representation than in native format.

dscho commented 10 years ago

Ah, the reason is simple: the serialized data stream defines the class only once, in the beginning, and then re-uses that class definition over and over again, while JSON has to specify the field names over and over again.

dscho commented 10 years ago

It just struck me that we can find a compromise between what @jglick wants (the output of deserialization should be ready-to-use SezPoz data, without the need to convert anything only in IndexItem's AnnotationProxy#evaluate() method: I could introduce a layer above the JSONInputStream that knows about SezPoz, and automatically converts the types it finds appropriately (which is quite possible because the four different SezPoz-specific data types can be identified by the keys of the maps they are written out as).

jglick commented 10 years ago

No problem from my side with opening a pull request with a concrete proposal, I was just concerned about wasted effort if we agreed on a significantly different approach.

It [de]serializes the SerAnnotatedElement objects one by one

Right, that is what I meant, sorry for being unclear.

I continue to request that the serialized form of annotations should be entirely self-contained, including exact data types, so there is no ambiguity about what was actually indexed that can only be resolved by referring to the classes as they existed at the time of the indexing.

(Anyway you cannot compatibly change the format of existing annotations except in very simple ways in Java generally, such as adding new attributes with default values, and SezPoz should support no more than that.)

the fact that SezPoz stored those implementation details always struck me as an unfortunate side effect of the use of Java's serialization

Not at all. My intention was to model as best as possible what Java bytecode would actually store for an annotation metadata attribute. (We could use that actual format extracted from its class container context, but it is rather awkward with the constant pool and all, and not at all human-readable.)

FWIW it is a bug in the Android libraries if it fails to properly deserialize a TreeMap, as these serial forms are defined by the Java specification. At any rate, avoiding a reliance on Java serialization is certainly an improvement.

Regarding numeric types, at a minimum going from double to long loses information. Even if you insisted on omitting concrete types in the JSON, using JSON numbers is entirely unnecessary when you could use strings and parse them on demand according to the target type (which also bypasses all the corner cases with NaN and so on).

one should probably consider the META-INF/annotations/ space as data internal to SezPoz that should not be accessed directly, but only through the SezPoz library

Yes of course. Nonetheless when debugging issues that might be SezPoz-related, it would be preferable for files to have the conventional extensions.

every single current user of the ObjectInputStream […] will have to implement extra handling for JSONInputStreams

I imagine you could wrap this handling into a utility method taking a ClassLoader and a class name and returning Iterator<SerAnnotatedElement> or the like.

I suggest using META-INF/json-annotations/<class-name>.json instead of META-INF/annotations/<class-name>.json

That would indeed defend against the vanishingly small possibility that any existing class registered in SezPoz actually has the simple name json.

Not a major objection or anything, I am just uncomfortable with the idea of reusing an existing filename for a completely different format. Imagine the confused error messages that would result if you somehow had an old version of the SezPoz runtime try to load a file created by the new processor. It feels clearer and safer to pick a new filename.

I did not touch the toString() method of IndexItem

Thought I saw some existing tests which had to be changed where the toString of something changed from what looks like a Java annotation to what looks like JSON text. Maybe not IndexItem, my mistake if not, this was just after a quick review.

verify that the access to the annotation indexes (as opposed to the generation of them)

@IgnoreJRERequirement is used by projects using Animal Sniffer to exclude certain classes or methods from the check. Probably just Indexer6 would need to be so marked.

dscho commented 10 years ago

I continue to request that the serialized form of annotations should be entirely self-contained, including exact data types, so there is no ambiguity about what was actually indexed that can only be resolved by referring to the classes as they existed at the time of the indexing.

(Anyway you cannot compatibly change the format of existing annotations except in very simple ways in Java generally, such as adding new attributes with default values, and SezPoz should support no more than that.)

I am really uncomfortable with that requirement, for the following reasons:

It completely contradicts your earlier comment about the format being overly verbose because now you ask to include even more information than I already did, and information at that which is already in the class path to be stored, adding redundancy. Either you want the JSON format to be succinct or not... ;-)
It appears to be overly restrictive to store the data types together with the data when we actually do not need to. It does not add any compile-time or run-time safety, either, to do so.
While you cannot compatibly change the format of existing annotations unless you change over all users, that latter part is possible, and besides: it is not SezPoz' job to test that the annotations' fields' data types are unchanged since the time the annotation processor ran. So it appears to me that serializing the exact data type, and then verifying that when de-serializing, is a lot of churn for at most debatable benefit...

FWIW it is a bug in the Android libraries if it fails to properly deserialize a TreeMap, as these serial forms are defined by the Java specification. At any rate, avoiding a reliance on Java serialization is certainly an improvement.

Unfortunately, this is incorrect. The Android libraries are perfectly valid! The assumption that serialized TreeMap objects can be deserialized no matter which implementation of that class was used at serialization time is what causes the problem.

The exact implementation details of the TreeMap serialization happen to be the concern of the different class path libraries. Assuming that those implementation details are identical between those class path libraries (or even just different versions of said libraries) is not in accordance with the Java standard.

It just happens not to have bitten us so far.

But as soon as the serialVersionUID of the TreeMap is changed (which is expressly allowed by the Java standard, in fact, it is required when the implementation changes substantially), the whole house of cards falls down.

Which is exactly what happened with Android: for all I know, Android's class path library's TreeMap has a different serialVersionUID and maybe it does not even serialize the entries in the custom section of the serialized data in the form: <size-as-int>[<key><value>]* as Oracle's Java class path library does.

This difference between class path libraries is in absolute agreement with the Java standard. Serialization of objects is not considered to be compatible between class path libraries, or even different versions of the same class path library. Nothing in the Java standard suggests that developers can rely on being able to de-serialize data using anything else but the exact same virtual machine with the exact same class path as when the data were serialized.

That is why I would like to suggest that a library such as SezPoz -- which is supposed to be able to run as an annotation processor on the developer's machine in a JVM that is possibly completely different from the JVM in which the final program is run, using SezPoz' serialized annotations -- cannot use Java's serialization, except maybe if all the classes whose instances are serialized are tightly controlled by SezPoz, and that precludes the TreeMap class which is completely outside SezPoz' purview.

Regarding numeric types, at a minimum going from double to long loses information.

Yes, that's why I never did that.

Even if you insisted on omitting concrete types in the JSON, using JSON numbers is entirely unnecessary when you could use strings and parse them on demand according to the target type (which also bypasses all the corner cases with NaN and so on).

And that is almost what I do. I just parse numbers that contain a dot into double values, and all integer types into long values (previously, I parsed them into int values unless the value was larger than Integer.MAX_VALUE which missed out on the negative side, but I fixed that locally already).

Convinced?

one should probably consider the META-INF/annotations/ space as data internal to SezPoz that should not be accessed directly, but only through the SezPoz library

Yes of course. Nonetheless when debugging issues that might be SezPoz-related, it would be preferable for files to have the conventional extensions.

However, requiring a file extension is a deviation from how things were done previously.

In the least, you are asking me to duplicate the resource, the resources and the ois fields in the Index, some rather non-DRY code to handle one set of those fields in the same way as the other set, but only if the latter is exhausted of entries, and since JSONInputStream does not extend ObjectInputStream (and for a good reason: the former is not as general-purpose as the latter, and neither is it intended to), this could not be abstracted away into a small class, either, but would have to be written out in at least the first two places mentioned above (the unit test would not need to care about the old-style serialization).

All of this to support two types of serializations, when it should be considered an error if the same .jar file contained both!

In short, the requested change would result in quite a bit of almost-duplicated code that almost certainly opens the door for subtle bugs, for little benefit, when the alternative would be substantially shorter in code, and easier to understand.

Even worse: developers would run into problems when not running mvn clean before rebuilding their projects with a new SezPoz versions because they would have a stale annotation index in their META-INF/annotations/.

Of course, we could work around that by preferring the JSON over the non-JSON annotation indexes when we have both in the same .jar file, but that would require inspecting the URL of the getResources() call and keeping track of the class path elements we inspected already, resulting in even more complexity. And of course, it would literally guarantee that .jar files would be generated and deployed with completely useless (because outdated) annotation indexes.

So I really would like to ask you to reconsider all those problems, and then the question whether you would prefer the immense complexity and the incurred problems due to mutually contradicting annotation indexes in the same class path element over the sweet and small and robust change to inspect the first byte (0xac -- the tell-tale for Java serialization -- is not a valid first byte for any JSON data) and just fall back to ObjectInputStream when appropriate.

Not a major objection or anything, I am just uncomfortable with the idea of reusing an existing filename for a completely different format.

... while I am uncomfortable for having different file names for the exact same information, opening the door to the problem when there are two files present which contradict each other.

Imagine the confused error messages that would result if you somehow had an old version of the SezPoz runtime try to load a file created by the new processor.

... as opposed to the lack of error messages when SezPoz fails to find any annotations, leaving the user confused. Or even worse, SezPoz actually finding an obsolete version of the annotation index (which is now thoroughly out of date with the actual annotations present in the .jar file, causing a lot more confusing error messages).

Please understand that I am not trying to be a dick here, and I would gladly let myself be convinced of a superior solution. I am not married to the particulars of my solution. But I am interested in a robust solution. In fact, I am interested in the best solution we can come up with together (because I am convinced that one developer is always less intelligent than two developers).

I did not touch the toString() method of IndexItem

Thought I saw some existing tests which had to be changed where the toString of something changed from what looks like a Java annotation to what looks like JSON text. Maybe not IndexItem, my mistake if not, this was just after a quick review.

I changed some tests because they assumed that the de-serialized objects were of type SerEnumConst.

In the upcoming version of this pull request, these tests do not need to be changed because the de-serialized objects are identical to what they were before (as I said, I would have been more happy with IndexItem.AnnotationProxy converting the data types, but I figure that you won't accept that change).

verify that the access to the annotation indexes (as opposed to the generation of them)

@IgnoreJRERequirement is used by projects using Animal Sniffer to exclude certain classes or methods from the check. Probably just Indexer6 would need to be so marked.

I am just concerned that this would stop defending against regressions with regard to annotation processing in JDK6...

jglick commented 10 years ago

Way too long to respond today, sorry. Will try to get back to you.

dscho commented 10 years ago

@jglick yep, I understand. It's just a very important issue for the project I try to get supported better by SezPoz, that's why I really tried to avoid coming over as if I had not thought about this long and hard. There are very tricky issues, and we rely on SezPoz to be robust.

dscho commented 10 years ago

So here is another two hours worth of work: https://github.com/dscho/sezpoz/compare/json-v4. It does the following:

implements a type-safe JSON mapping for the annotation indexes (primitive types are written out as {"type":"<type>","value":"<value>"} instead of the former <value> that is cast to the appropriate type when the annotation proxy is invoked).
as a consequence, this iteration (and only this iteration) now truly deserves that scathing comment "The new format seems unnecessarily verbose": remember when I stated that ij-core's annotation index went from 3849 bytes (with Java serialization) to 4406 bytes and then to 4106 bytes? Well, now it is 5668 bytes. That's what one could rightfully call bloat.
as a further consequence, the nice speedup of the unit tests is gone. The performance is back to where it was with the Java serialization.
this iteration proves that the code as I proposed it originally (with lazy type evaluation, quicker runtime and smaller JSON output size) is simpler than the code in json-v4: despite deleting many Javadocs without adding new ones, and despite deleting the backwards-compatibility unit test by accident, the tally is 358 insertions(+), 287 deletions(-), i.e. there is more code now than there was earlier.
it precludes everybody from being entitled to claim that I would be unwilling to spend an insane amount of time to implement feature requests even if I disagree with them.

So I would like to register that I proved beyond reasonable doubt that the former solution (to write out unbloated JSON output that is made type-safe only when it is matched up with the annotation classes) is superior to the new solution (writing JSON that contains enough information to be read back type-safely without inspecting the annotation classes).

dscho commented 10 years ago

After mulling over this pull request for a night, I am more and more convinced that I need to take a couple of steps back, and a few deep breaths. And because I would be too ashamed if any parts of json-v4 made it into SezPoz, I deleted that branch.

Sorry for all the trouble!

buildhive commented 10 years ago

Jesse Glick » sezpoz #16 SUCCESS This pull request looks good (what's this?)

jglick commented 10 years ago

First of all, rest assured that I do want this change in some form or another in SezPoz, and believe that you have thought about the implications in some detail, since you are probably its heaviest user in the general form. (Jenkins uses it constantly, but in a degenerate form without annotation attributes.) I do not have much time to spend on it so I just want to ensure that it does not become substantially more complex or buggy in the functionality it already handles.

Regarding the self-contained format: if it is substantially slower then I guess we do without. Still makes me uncomfortable, but oh well.

Regarding numbers: casting long or double to other numeric types still seems troubling. For integer types you need only consider overflow (i.e. handle it with a proper error), but for floating-point types you need to consider significant bits, NaN, infinities, -0, and so on. And why go to all this effort? Just write out the value as a string (using e.g. Integer.toString), and when necessary (such as in the annotation proxy) read it back out using the known type (e.g. Double.parseString). This solves all such possible issues at once, and as a bonus a hand-written JSON parser need not even recognize numeric literals, making it considerably simpler: if you do the same for boolean values then all tokens are {, }, :, [, ], ,, and " strings with \ escapes (you can assume there is no whitespace unless the indexer is given an option to add some, and null cannot be used in annotations).

Regarding @IgnoreJRERequirement: yes, currently Animal Sniffer does not support applying one constraint to one section of a module, and another to a different section. If you are worried about Java 7 dependencies creeping into Indexer6 then this could be moved into a separate development-time-only module. The reason I did not do this to begin with was that I really wanted to make sure that the indexing was automatic when the SezPoz library was on your classpath, so that you do not use its API and silently get no index. Could be accomplished by using some Maven trickery to create a union JAR of the two modules, ideally only publishing that union JAR by somehow suppressing deployment for its components. Anyway I am less concerned about this since the fallout of such an accidental dependency would be that developers building with an old JDK and upgrading SezPoz would find compilation broken until they reverted the update, which is no big deal compared to a new SezPoz runtime being deployed to users and throwing LinkageErrors when loaded in an unexpectedly old JRE.

Regarding one filename vs. two: if it is not practical to pick a distinct filename then so be it.

Regarding Java serialization (this is really an aside!), it is still my impression that the Java specification does mandate particular serial forms for classes specified in the platform. That is why they are documented. Probably the Android project simply does not care, as they are not attempting to make a fully compatible implementation.

dscho commented 10 years ago

Regarding Java serialization (this is really an aside!), it is still my impression that the Java specification does mandate particular serial forms for classes specified in the platform.

I never contested that.

It is true, however, that the serialization of different versions even of the same class is in general incompatible.

In the simplest form, the breakage occurs when you add a field. For obvious reasons, of course.

It gets more subtle when you change the serialVersionUID. I have no idea whether Java deserialization fails if it encounters a different serialVersionUID than the one in the current version of the serialized object's class.

However, if you have a class which implements its own readObject(ObjectInputStream)/writeObject(ObjectOutputStream) combo, then it would be a serious bug if Java's serialization did not care about the serialVersionUID because in that case, there are usually transient fields in the class that get initialized by the readObject() method (and serialized by writeObject()) and the serialVersionUID is a marker for the different versions.

This fact is nobody's fault: not the fault of the people implementing the classes nor the fault of the Java standard nor the fault of the users. It is just fundamentally impossible to guarantee a serialization format that is 100% compatible between different class implementations when so much as the name of private fields can be chosen freely by the developers.

Now, if you have two developers developing the same class with the same signatures but at least one of the implementations implements the readObject/writeObject pair -- e.g. for efficiency -- while neither of the developers dares to look at the other one's output for licensing reasons, it is virtually guaranteed that those class' serializations are incompatible with each other. Just the serialization, but that is already enough, obviously, to break a model that is based on serializing data in one VM and deserializing it in the other VM when one, just one, even just only one single serialized object has a class whose serialization format is possibly different between VMs.

You probably guessed it: TreeMap is such a class.

So no, Android's development team did nothing wrong, Oracle's (and formerly Sun's) Java developers did nothing wrong, not even the Java architects did anything wrong, there is no bug that is the reason why annotation indexes written out using SezPoz with an Oracle Java cannot be read back using SezPoz with Android. The simple reason is that SezPoz relies on the serialization format of TreeMap (and also ArrayList) to be constant between different Java class path libraries, when it simply is not (and neither does the Java specification mandate it to be, all it mandates is that the serialization/deserialization of the exact same class version' is constant between Java class path libraries -- and this is the case with Android and Oracle Java, and now also Avian because I implemented that).

Of course, this is not a problem as long as you only ever run things on Oracle Java (or OpenJDK whose class path library is compatible with Oracle's, for obvious reasons).

jglick commented 10 years ago

In fact the serial form of TreeMap is documented (no need to look at source code):

The size of the TreeMap (the number of key-value mappings) is emitted (int), followed by the key (Object) and value (Object) for each key-value mapping represented by the TreeMap. The key-value mappings are emitted in key-order (as determined by the TreeMap's Comparator, or by the keys' natural ordering if the TreeMap has no Comparator).

The Android developers just did not know that, or chose to ignore it.

dscho commented 10 years ago

Oh yes, there is no doubt that the TreeMap serialization of the Oracle Java class path library is documented, even before Sun open-sourced their Java implementation: http://docs.oracle.com/javase/1.5.0/docs/api/serialized-form.html#java.util.TreeMap

However, I was talking about the Java specification that was the basis of Apache Harmony and as a consequence of Dalvik, and what everyone who wants to implement a Java class path library, and I guess that they based their implementation on Java 1.4.2's as described in the documentation here: http://www.few.vu.nl/~eliens/documents/java/jdk1.2-docs/docs/api/java/util/TreeMap.html (officially only available from the Java Archive: http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-142docs-2045554.html)

As I said, this is not a problem as long as you only ever run things on Oracle Java, so I am less and less concerned about changing SezPoz' serialization.

jglick / sezpoz

Serialize indexed annotations as JSON #7