vincenzobaz / spark-scala3

Apache License 2.0
88 stars 15 forks source link

Configurable tuple encoding #56

Closed jberkel closed 10 months ago

jberkel commented 10 months ago

Right now tuples are encoded as structs with keys _1, _2, etc. for each element

("a", "b", "c") => { "_1": "a", "_2": "b", "_3": "c" }

This is understandable, as tuples are just a type of Product. However, when compatibility with other libraries is required, this is a problem. Upickle for instance serializes tuples as lists:

// Tuples of all sizes (1-22) are serialized as heterogenous JSON lists 
write((1, "omg"))                 ==> """[1,"omg"]"""
write((1, "omg", true))           ==> """[1,"omg",true]"""

https://com-lihaoyi.github.io/upickle/#Collections

Does it make sense to support this form of tuple encoding? It'd also help to save space, as no keys are needed. However, I imagine it wouldn't play well with the schema, for mixed-type tuples?

vincenzobaz commented 10 months ago

Hey @jberkel , the current behavior with _n mirrors the behavior of the encoder for tuples provided by Spark. You can build a different encoder buy declaring defining your own Serializer and Deserializer for a Tuple (e.g. given [T] Serializer[T <: Tuple] = ???) It is a bit involved but should be doable.

jberkel commented 10 months ago

@vincenzobaz Thanks for the pointer, I'll give it a go and add the implementation to this issue.

vincenzobaz commented 10 months ago

@jberkel no problem! We could add your implementation to the readme as an example of encoder customization :)