Closed ghislainfourny closed 4 years ago
Hmm, I'll have to take a look into what I need to support that. I presume I need to just implement an interface or something. Thanks for the report!
Thanks a lot, Andrew!
So, this actually is more complicated than I had hoped .. the "typical" way to register classes in Kryo (from the user side) is to add the necessary classes to the SparkConf. But the datasource isn't initialized until after the SparkContext is created, and after the SparkContext is initialized, changes to the config don't take effect. I've asked the spark list for clarification, and I'll look into it some more.
Thanks a lot, Andrew! For now we are deactivating Kryo and use the Java serializer instead whenever we read ROOT files, which works fine (just slower). So this does not block us.
Hi Ghislain, It actually turns out that something is actually unserializable -- if I manually add some options for spark.kryo.classesToRegister, it still fails. I added a testcase to round-trip the offending manually kryo, but that doesn't trigger the error. (I also misread the kryo docs, you don't have to manually register every class, it just improves performance to do so)
@ghislainfourny Hi! I think the above PR (#85) fixes the Kryo serialization problem. Or, at least, the test I wrote fails without the changes and succeeds with them. Would you be willing to test and let me know if it works w/your use-case? I assume you don't set spark.kryo.registrationRequired
to true, right?
Hi Andrew, thanks a lot! I'll give it a try -- although we are using Laurelin with a maven dependency, so I am not sure how easy it is to point our project to a Laurelin checkout instead. If I am blocked I will let you know.
Okay -- I'd prefer to not cut a release with a half-fix, but I honestly don't know the best way to publish an "intermediate artifact" in Maven. If you don't have a good idea either, I'll make a release with this branch
Dear Andrew,
My pleasure to confirm that it works -- with Kryo (see below). Thanks a lot!
Kind regards, Ghislain
$ spark-submit target/spark-rumble-1.5-jar-with-dependencies.jar --shell yes --show-error-info yes
20/04/03 16:58:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
____ __ __
/ __ \__ ______ ___ / /_ / /__
/ /_/ / / / / __ `__ \/ __ \/ / _ \ The distributed JSONiq engine
/ _, _/ /_/ / / / / / / /_/ / / __/ 1.5 "Southern Live Oak" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/
Master: local[*]
Item Display Limit: -
Output Path: -
Log Path: -
Query Path : -
rumble$ root-file("../laurelin/./testdata/uproot-small-flat-tree.root", "tree")
>>>
>>>
{ "Int32" : 0, "Int64" : 0, "UInt32" : 0, "UInt64" : 0, "Float32" : 0, "Float64" : 0, "ArrayInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "N" : 0, "SliceInt32" : [ ], "SliceInt64" : [ ], "SliceUInt32" : [ ], "SliceUInt64" : [ ], "SliceFloat32" : [ ], "SliceFloat64" : [ ] }
{ "Int32" : 1, "Int64" : 1, "UInt32" : 1, "UInt64" : 1, "Float32" : 1, "Float64" : 1, "ArrayInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "N" : 1, "SliceInt32" : [ 1 ], "SliceInt64" : [ 1 ], "SliceUInt32" : [ 1 ], "SliceUInt64" : [ 1 ], "SliceFloat32" : [ 1 ], "SliceFloat64" : [ 1 ] }
...
Very nice! I'll cut a release shortly.
It seems that Laurelin does not support Kryo serialization. The following error is thrown:
It would be nice to add Kryo support, as this makes Spark querying considerably faster.
(it is activated with
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
on spark-submit).