vincenzobaz / spark-scala3

Apache License 2.0
89 stars 15 forks source link

Improve error message when encoding/decoding recursive case classes #54

Open jberkel opened 1 year ago

jberkel commented 1 year ago

I'm not sure if this is something supported by Spark, but I have the following recursive case class:

case class SerializedSense(
    definition: String,
    subsenses: Seq[SerializedSense] = Seq.empty
) derives upickle.default.ReadWriter

I can serialize/deserialize this successfully using upickle, but when using the same case class inside Spark

import scala3encoders.given

def process(input: DataFrame): Unit =
  input
    .as[SerializedSense]
    …

I get the following compiler error:

No given instance of type scala3encoders.derivation.Deserializer[Seq[SerializedSense]] was found. I found:
…

Is this a limitation of Spark or is this a case not yet supported by the encoder?

michael72 commented 1 year ago

@jberkel the error message is misleading - not sure how to fix that, but spark does not support recursive (circular) types

when trying to compile encoding/decoding of a recursive class in vanilla spark (using scala 2.13) the error message is a bit more helpful maybe:

[error] org.apache.spark.SparkUnsupportedOperationException: cannot have circular references in class, but got the circular reference of class <classname>.
jberkel commented 1 year ago

Ok, that's what I suspected. If the error message can be improved, good, otherwise it's not a big deal.

michael72 commented 1 year ago

Correction on my part - it is not a compile time error but a runtime error. So I'm not sure if we can do it in this library. Maybe using walkedTypePath for that. However there are more complicated scenarios and even spark doesn't respond with a proper error message - e.g.

trait Node
case class Tree(children: List[Node]) extends Node
case class Leaf(info: String) extends Node

val chk = List(Tree(List(Leaf("hello")))).toDF().as[Tree]

will give you:

[error] org.apache.spark.SparkUnsupportedOperationException: [ENCODER_NOT_FOUND] Not found an encoder of the type Node to Spark SQL internal representation. Consider to change the input type to one of supported at 'https://spark.apache.org/docs/latest/sql-ref-datatypes.html'.

whereas List(Leaf("hello")).toDF().as[Leaf] is fine.

OK - maybe a bad example. Let's just say types are tricky ;-)

jberkel commented 1 year ago

Correction on my part - it is not a compile time error but a runtime error. So I'm not sure if we can do it in this library. Maybe using walkedTypePath for that.

The fact that this is now a compile time error and not a runtime error is already a huge win!

I'm wondering if @implicitNotFound could be used to customize the message?