target / data-validator

A tool to validate data, built around Apache Spark.
Other
100 stars 34 forks source link

Upgrade to modern spark (3.4) #166

Closed holdenk closed 3 months ago

holdenk commented 11 months ago

Hey y'all poked around in the project, I don't know if this is still being actively developed but if folks are interested upgraded it to Spark 3.4, I've only done the tests with the built in test suite and it's my first time poking around in the project so you know could be some mistakes.

colindean commented 11 months ago

I really appreciate the contribution — first outside contribution in some time! For our purposes, DV must continue working with Spark 2.3 and JDK8, so we can't bump versions as you have proposed.

Can you think of a way to test/qualify this against both Spark 2.3 and 3.4? As a potential starting point, I started on crossbuilding last year but haven't had the bandwidth to complete the work.

holdenk commented 11 months ago

So one lazy option would be to make a spark 3 and a spark 2 branch and publish a "-3" version of the artifact rather than a "proper" cross build.

c-horn commented 11 months ago

I could potentially take a look at finishing the cross-build work @colindean Are there still users of this internally at Target who can't switch to Spark 3.X + docker?

Also many thanks @holdenk for keeping this project / others in the spark ecosystem in mind.

holdenk commented 11 months ago

So I was thinking the main "annoying" thing in a cross build for this is probably the tests and maybe the quoted identifiers. We could probably make it cross build easily, although my default is lazy (branches).

colindean commented 11 months ago

So one lazy option would be to make a spark 3 and a spark 2 branch and publish a "-3" version of the artifact rather than a "proper" cross build.

That's a good idea, but it introduces some operational complexity to be recalled at release time. I prefer the technical complexity of an sbt crossbuild, especially if…

I could potentially take a look at finishing the cross-build work @colindean

So I was thinking the main "annoying" thing in a cross build for this is probably the tests and maybe the quoted identifiers. We could probably make it cross build easily, although my default is lazy (branches).

❤️ Yes that would be awesome. I don't remember what was going awry when I left off with it. I welcome the contribution, including pairing if either of you'd like (or you could pair the two of you if you'd like).

Are there still users of this internally at Target who can't switch to Spark 3.X + docker?

I've not seen pressure from DV users to support Spark 3, but most DV users who have identified themselves are internal to Target and stuck on our Spark 2.3 cluster while we await the production launch of our internal platform enabling use of Spark 3 + Docker.

Also many thanks @holdenk for keeping this project / others in the spark ecosystem in mind.

Yes, indeed! I'm very happy to see this and your continued work in the space, and to see another former IBMer around!

holdenk commented 11 months ago

So my other project where I do more extensive cross-building is https://github.com/holdenk/spark-testing-base/blob/main/build.sbt , it puts the actual Spark Versions outside of sbt (passed in as a parameter) and does "classic" sbt cross-build for the scala versions. What do folks think of something like that?

Also thanks for being so friendly y'all :)

holdenk commented 11 months ago

Also, slightly off topic, (in the context of having a large pre-Spark 3 codebase) my self and some of my coworkers have been working on some tooling at https://github.com/holdenk/spark-upgrade that might be useful once you'all have Spark 3 ready for use :)

holdenk commented 11 months ago

A bit of work to see if folks agree with the direction, if so I'll update the tests and such :)

holdenk commented 10 months ago

Hey just following up @c-horn do you have some bandwidth to review this?

c-horn commented 10 months ago

Hi, sorry, my free time plans got ahead of me, I will be out of the country for a few weeks starting weds. I will try to take a look at this tonight or tomorrow.

colindean commented 6 months ago

It's been a while but I'm hoping to revisit this mid-February. I shook out some backed-up update PRs and will take a close look at this when I return from FOSDEM — @holdenk, I noted that you're speaking there and I hope to attend your talk and meet you!

holdenk commented 6 months ago

Sounds awesome, maybe we can grab some coffee:)

colindean commented 3 months ago

Good news! The requirement to support Spark 2 is now gone as my org is moving to Spark 3 by the end of May. I'm going to return to this in the coming days/weeks as I've gotten a few requests for it.

colindean commented 3 months ago

We're going into Spark 3 at v3.5.1 and still supporting Scala 2.12 and 2.13.

colindean commented 3 months ago

186 sets some new baseline. I think I want to get this merged (probably with the suggestions to move to Spark 3.5.1 merged) and then continue from there.

jaspal1404 commented 3 months ago

@colindean @holdenk - Here is the change that is required in build.sbt to make the DV work with Spark version 3.x. I tested it and its working:

current circe-yaml version : 0.14.2 new circe-yaml version : 1.15.0

Change line number 35 in build.sbt from 0.14.2 to 1.15.0

Another thing I changed was line number 6 as below (could not be needed, wasn't sure how we are getting the spark version from the env)

sparkVersion := "3.5.1"

colindean commented 3 months ago

Thank you so much, @holdenk, for submitting this. I'm sorry it took a while for priorities to bubble up. I'm glad that we were able to merge it eventually.