moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.4k stars 151 forks source link

[FEAT] Scala 2.13 support? #2132

Open kg005 opened 8 months ago

kg005 commented 8 months ago

Is your proposal related to a problem?

I am getting following error:

24/04/04 14:26:47 WARN TaskSetManager: Lost task 4.0 in stage 538.0 (TID 7052) (10.132.0.177 executor 1): org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (UDFRegistration$Lambda$4595/0x00007f30033f42d8: (string, string) => double).
    at org.apache.spark.sql.errors.QueryExecutionErrors$.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala:217)
    at org.apache.spark.sql.errors.QueryExecutionErrors.failedExecuteUserDefinedFunctionError(QueryExecutionErrors.scala)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.project_doConsume_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.hashAgg_doAggregateWithKeys_0$(Unknown Source)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
    at scala.collection.Iterator$$anon$9.hasNext(Iterator.scala:576)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:101)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
    at org.apache.spark.scheduler.Task.run(Task.scala:139)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.NoSuchMethodError: 'scala.collection.GenMap scala.collection.mutable.Map$.apply(scala.collection.Seq)'
    at uk.gov.moj.dash.linkage.LevDamerauDistance.call(Similarity.scala:265)
    at uk.gov.moj.dash.linkage.LevDamerauDistance.call(Similarity.scala:254)
    at org.apache.spark.sql.UDFRegistration.$anonfun$register$354(UDFRegistration.scala:767)
    ... 18 more

With no prior knowledge of scala, after some exploration of:

Describe the solution you'd like

Building .jar files from https://github.com/moj-analytical-services/splink_scalaudfs for scala 2.13.

Describe alternatives you've considered

Changing my environment to use scala 2.12 but I am currently not in a position to be able to change the environment I am running the splink on.

RobinL commented 7 months ago

Thanks for the request. We're pretty stretched at the moment so we're unlikely to be able to get round to this soon. If you're willing/able, feel free to do a PR, which would be gratefully accepted!

kg005 commented 7 months ago

Hi @RobinL, here is a PR for the changes needed to build the splink_scalaudfs for Scala 2.13. As I am new to Scala, I would be happy to have it reviewed so I can adjust it as needed.

RobinL commented 7 months ago

@kg005 Thank very much. Just to say we're taking a look at this. I'm also not a scala person myself, but the code looks ok to me at least.

One thing we need to be careful with is accepting an external PR that includes the jar, since we have no easy way of knowing whether it contains malicious code. (The diff looks ok, and the code you're wrote looks fine btw, so this is no reflection on you, just security policy!)

I'm going to try and get a colleague to build it on their machine. But if you happen to work for somewhere 'trusted' (e.g. uk gov, let me know and it'll make it a little easier - robinlinacre@hotmail.com!)

kg005 commented 7 months ago

Thanks for the heads up @RobinL. I understand the policies. Feel free to override the jar with a new version that you manage to build using your infrastructure.