spark-root / laurelin

Allows reading ROOT TTrees into Apache Spark as DataFrames
BSD 3-Clause "New" or "Revised" License
10 stars 4 forks source link

Support for Kryo serialization #82

Closed ghislainfourny closed 4 years ago

ghislainfourny commented 4 years ago

It seems that Laurelin does not support Kryo serialization. The following error is thrown:

com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException

It would be nice to add Kryo support, as this makes Spark querying considerably faster.

(it is activated with --conf spark.serializer=org.apache.spark.serializer.KryoSerializer on spark-submit).

PerilousApricot commented 4 years ago

Hmm, I'll have to take a look into what I need to support that. I presume I need to just implement an interface or something. Thanks for the report!

ghislainfourny commented 4 years ago

Thanks a lot, Andrew!

PerilousApricot commented 4 years ago

So, this actually is more complicated than I had hoped .. the "typical" way to register classes in Kryo (from the user side) is to add the necessary classes to the SparkConf. But the datasource isn't initialized until after the SparkContext is created, and after the SparkContext is initialized, changes to the config don't take effect. I've asked the spark list for clarification, and I'll look into it some more.

ghislainfourny commented 4 years ago

Thanks a lot, Andrew! For now we are deactivating Kryo and use the Java serializer instead whenever we read ROOT files, which works fine (just slower). So this does not block us.

PerilousApricot commented 4 years ago

Hi Ghislain, It actually turns out that something is actually unserializable -- if I manually add some options for spark.kryo.classesToRegister, it still fails. I added a testcase to round-trip the offending manually kryo, but that doesn't trigger the error. (I also misread the kryo docs, you don't have to manually register every class, it just improves performance to do so)

PerilousApricot commented 4 years ago

@ghislainfourny Hi! I think the above PR (#85) fixes the Kryo serialization problem. Or, at least, the test I wrote fails without the changes and succeeds with them. Would you be willing to test and let me know if it works w/your use-case? I assume you don't set spark.kryo.registrationRequired to true, right?

ghislainfourny commented 4 years ago

Hi Andrew, thanks a lot! I'll give it a try -- although we are using Laurelin with a maven dependency, so I am not sure how easy it is to point our project to a Laurelin checkout instead. If I am blocked I will let you know.

PerilousApricot commented 4 years ago

Okay -- I'd prefer to not cut a release with a half-fix, but I honestly don't know the best way to publish an "intermediate artifact" in Maven. If you don't have a good idea either, I'll make a release with this branch

ghislainfourny commented 4 years ago

Dear Andrew,

My pleasure to confirm that it works -- with Kryo (see below). Thanks a lot!

Kind regards, Ghislain

$ spark-submit target/spark-rumble-1.5-jar-with-dependencies.jar --shell yes --show-error-info yes
20/04/03 16:58:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    ____                  __    __   
   / __ \__  ______ ___  / /_  / /__ 
  / /_/ / / / / __ `__ \/ __ \/ / _ \  The distributed JSONiq engine
 / _, _/ /_/ / / / / / / /_/ / /  __/  1.5 "Southern Live Oak" beta
/_/ |_|\__,_/_/ /_/ /_/_.___/_/\___/

Master: local[*]
Item Display Limit: -
Output Path: -
Log Path: -
Query Path : -

rumble$ root-file("../laurelin/./testdata/uproot-small-flat-tree.root", "tree")
>>> 
>>> 
{ "Int32" : 0, "Int64" : 0, "UInt32" : 0, "UInt64" : 0, "Float32" : 0, "Float64" : 0, "ArrayInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayUInt64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat32" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "ArrayFloat64" : [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ], "N" : 0, "SliceInt32" : [ ], "SliceInt64" : [ ], "SliceUInt32" : [ ], "SliceUInt64" : [ ], "SliceFloat32" : [ ], "SliceFloat64" : [ ] }
{ "Int32" : 1, "Int64" : 1, "UInt32" : 1, "UInt64" : 1, "Float32" : 1, "Float64" : 1, "ArrayInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayUInt64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat32" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "ArrayFloat64" : [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "N" : 1, "SliceInt32" : [ 1 ], "SliceInt64" : [ 1 ], "SliceUInt32" : [ 1 ], "SliceUInt64" : [ 1 ], "SliceFloat32" : [ 1 ], "SliceFloat64" : [ 1 ] }
...
PerilousApricot commented 4 years ago

Very nice! I'll cut a release shortly.