vincenzobaz / spark-scala3

Apache License 2.0
88 stars 15 forks source link

Coursera Scala Course's Capstone Uses Your Library, but it may not work in that condition. #48

Open codeaperature opened 12 months ago

codeaperature commented 12 months ago

Hi Vincenzo,

To me, it's unclear how to use your library and it's possible that Coursera Scala Course's Capstone (in the build file) has pointed to information that's not longer valid in the readme. I posted this to stackoverflow. This course is hard without being able to do the simple things - it would be nice if you updated your README markdown to help work this issue of TypeTags out. You can note that I tried to make the code on the stackoverflow match Spark's advice, but I also tried to follow the markdown, but didn't post that. In the coursera project, I don't think we can change the build file.

Stefan

vincenzobaz commented 12 months ago

Hi @codeaperature thank you for opening the issue!

To use our encoders, all you need is import scala3encoders.given, then they are available in the implicit scope and you can obtain a reference with summon.

I can adapt your stackoverflow snippets as follows:

import scala3encoders.given
import org.apache.spark.sql.Encoder

case class StationX(stnId: Int, wbanId: Int, lat: Double, lon: Double)

object Station extends App:
  val ss = summon[Encoder[StationX]]
  println(ss.schema)

and

package observatory
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
import scala.reflect.ClassTag
import scala.deriving.Mirror
import scala3udf.{Udf => udf}
import scala3encoders.given

case class CC(i: Int)
object SparkInstance extends App {
  val spark = SparkSession
    .builder()
    .appName("Spark SQL UDF scalar example")
    .getOrCreate()

  def getSchema[T: Mirror.ProductOf: ClassTag] = summon[Encoder[T]].schema
  val random = udf(() => Math.random())
  val plusOne = udf((x: Int) => x + 1)
  val ss = getSchema[CC]
}

You should not need to write a function such as getSchema

michael72 commented 12 months ago

I'm a little flustered and worried that an actual course uses spark together with Scala 3 - I would consider this combination experimental and not suited for beginners (although Scala 3 IMHO is much better than Scala 2).

vincenzobaz commented 12 months ago

@michael72 IIRC the course is offered in both Scala 2 and Scala 3. The assignments were tested in Scala 3 and many students have completed it successfully.

But it has been out for a while, maybe the course manager should investigate whether the scala 3 version has caused more problems...

codeaperature commented 11 months ago

I finally got back to this (I have a regular Data Eng job too) ... I do not believe the parameters of the project mean I can add in extra libraries and it seems that this part does not work in the project:

.../observatory/src/main/scala/observatory/SparkInstance.scala:8:8 Not found: scala3udf import scala3udf.{Udf => udf}

Maybe I made some other changes. BTW - Did you download the project or just check this in another way?

Since there is no requirement to use Spark and the assignment actually uses a jarred resource ... and per the course suggestion: the data needs to be stream-loaded into memory and then pushed into a spark dataframe/dataset to be processed. I think it's just unnecessary overhead in terms of memory, code and socket open/close time,... I can simply use parallel collections to do a simple join.

I'm going to drop this issue as I am taking a different path, but I am still curious if Coursera provided a bunk suggestion to use your library without supplying the proper tooling in the build.sbt.

Thanks for your past attention to look into this item.

vincenzobaz commented 11 months ago

I think I understand better the issue now. The assignment does not involve udfs, @michael72 implemented the udf a long time after the release of the course. I could reach out the new person in charge of the courses and tell them to include the udf dependency.

I will also ask if other people reported this issue. I am sorry for the frustration this has caused you. I collaborated with the course authors so I know it is not easy to maintain a large codebase and still make it extensible.

codeaperature commented 11 months ago

Yeah - I tried to do some things differently ... for example a UDF to convert deg C -> F, but this could be done in another way. Also, I wanted to use datasets with StructTypes automatically derived from case classes.

Thanks for looking into this item for me.