Open kklipsch opened 2 years ago
I am also looking at logical types and comparing Goavro to the Java implementation and this code that I managed to put together seems to confirm your findings:
// > export JAVA_HOME=/usr/local/opt/openjdk
// > export PATH="${JAVA_HOME}/bin:$PATH"
// > java --version
// openjdk 18.0.1 2022-04-19
// OpenJDK Runtime Environment Homebrew (build 18.0.1+0)
// OpenJDK 64-Bit Server VM Homebrew (build 18.0.1+0, mixed mode, sharing)
// > java -cp avro_1.11/avro-1.11.0.jar:avro_1.11/jackson-core-2.12.5.jar:avro_1.11/jackson-annotations-2.12.5.jar:avro_1.11/jackson-databind-2.12.5.jar:avro_1.11/slf4j-api-1.7.32.jar Main.java
import java.math.BigInteger;
import java.util.Arrays;
import java.io.*;
import org.apache.avro.Schema;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.*;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.Conversions;
import org.apache.avro.file.DataFileReader;
import java.util.HexFormat;
public class Main {
static byte[] fromJsonToAvro(String json, Schema schema) throws Exception {
InputStream input = new ByteArrayInputStream(json.getBytes());
DataInputStream din = new DataInputStream(input);
Decoder decoder = DecoderFactory.get().jsonDecoder(schema, din);
DatumReader<Object> reader = new GenericDatumReader<Object>(schema);
Object datum = reader.read(null, decoder);
GenericDatumWriter<Object> w = new GenericDatumWriter<Object>(schema);
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
Encoder e = EncoderFactory.get().binaryEncoder(outputStream, null);
w.write(datum, e);
e.flush();
return outputStream.toByteArray();
}
public static String avroToJson(Schema schema, byte[] avroBinary) throws IOException {
DatumReader<Object> datumReader = new GenericDatumReader<>(schema);
Decoder decoder = DecoderFactory.get().binaryDecoder(avroBinary, null);
Object avroDatum = datumReader.read(null, decoder);
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DatumWriter<Object> writer = new GenericDatumWriter<>(schema);
JsonEncoder encoder = EncoderFactory.get().jsonEncoder(schema, baos, false);
writer.write(avroDatum, encoder);
encoder.flush();
baos.flush();
return new String(baos.toByteArray());
}
static String schemaJSON = """
{
"type": "record",
"name": "nester",
"namespace": "test",
"fields" : [
{"name": "a", "type": ["null", {"type": "long", "logicalType":"timestamp-millis"}], "default": null}
]
}
""";
static String inputJSON = """
{"a":{"long":1136214245565}}
""";
public static void main(String[] args) {
try {
Schema schema = new Schema.Parser().parse(schemaJSON);
byte[] avroByteArray = fromJsonToAvro(inputJSON, schema);
String outputJson = avroToJson(schema, avroByteArray);
System.out.println(outputJson);
}
catch (Exception e) {
System.out.println(e);
}
}
}
It outputs {"a":{"long":1136214245565}}
. Also, if I change the inputJSON
to {"a":{"long.timestamp-millis":1136214245565}}
, it fails with org.apache.avro.AvroTypeException: Unknown union branch long.timestamp-millis
, which makes sense given the output format...
Disclaimer: I'm not really experienced with Java and this is some copy/paste code from StackOverflow that I modified a bit....
Thanks for this issue. This will take some thinking over here. I'm not sure if this means the current implementation is simply broken, or if its just in violation of the spec (which can be regarded as a form of being broken too). Also I'm not sure is fixing this specific problem would cause breakage in the existing user base. It does sound like a risky fix. Please provide your comments if you know about those topics.
From my point of view this means that JSON encoding is broken. You won't be able to use json encoded with goavro with other implementations of avro.
It likely will break any use cases where a current user is a) using json encoding with logicalTypes b) reading and writing exclusively with goavro (because if they had another client in the mix it would already be broken).
Given how long this library has been around, a new major version should be released if this change is made. My understanding is that logical types are not used used by many people, so I think most users won't have any issues when upgrading to the new major version, which they'd need to do manually anyway. Adding and maintaining a new API for this seems more painful long term...
@kklipsch it sounds like you are in favor of getting this one going, and we would need to make a major version bump since it is expected to be a breaking change. Is that right? Will such a breaking change open up a more bright future or will it simply annoy the existing user base? It DOES look like its actually a small (but breaking) change and definitely seems to be the correct operation per the spec. I suppose it should be fine as long as both are kept up-to-date for a reasonable support period. Given that changes are infrequent here, that should not be a big maintenance problem.
Thoughts from you @kklipsch
I don't see a PR linked here. Is this one done already?
I think it definitely needs to be fixed. I'm less opinionated on the version issue. To my mind this has always been broken. So its not a breaking change to fix it. But I can see the other side of the issue as well.
I did not actually attempt to fix this in a PR. I can take a crack at it if you like, but the earliest I'd be able to look at it would be next week.
I think it would be great if you'd fix it. We appreciate contributions!
Just a note that I won't be able to work on this for at least another week. I put an affirmative message if i take it on and anyone else should feel free to take it until I add that message.
Thanks, we'll see if anybody jumps at it.
The spec states "A logical type is always serialized using its underlying Avro type so that values are encoded in exactly the same way as the equivalent Avro type that does not have a logicalType attribute."
But goavro encodes the name of unions as "type.logicalType". The codec will not accept a record where the union is typed as anything else and it outputs this directly to json but that is incorrect. Binary encoding appears correct.