moj-analytical-services / splink

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends
https://moj-analytical-services.github.io/splink/
MIT License
1.32k stars 147 forks source link

Security exceptions when working with databrick unity catalog #2459

Open aamir-rj opened 1 week ago

aamir-rj commented 1 week ago

What happens?

When working on cluster which are in shared mode on unity catalog, splink throws py security exceptions

To Reproduce

Image

OS:

databricks

Splink version:

pip install splink==2.1.14

Have you tried this on the latest master branch?

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

fscholes commented 6 days ago

Incidentally enough, just yesterday I was talking to Databricks about this, and it's because Splink uses custom jars that aren't supported on shared mode/serverless clusters. They told me it is on the roadmap.

The current way around it, other than redeveloping Splink to remove the custom jars, is to use a single-user cluster on Databricks

RobinL commented 6 days ago

@aamir-rj I'm afraid we don't have access to databricks so can't really help out with these sorts of errors. There are various discussions of similar issues that may (but may not) help:

https://github.com/moj-analytical-services/splink/discussions/2295

Thanks @fscholes!. Incidentally, If you ever have chance to mention it to databricks, it'd be great if they could simply add a couple of (fairly simple) functions into DataBricks itself so the custom jar was no longer needed. it causes people a lot of hassle due to these security issues. The full list of custom udfs is here: https://github.com/moj-analytical-services/splink_scalaudfs

But probably jaro-winkler would be sufficient for the vast majority of users!

pallyndr commented 6 days ago

Are you aware Databricks has an ARC module, based on Splink? https://www.databricks.com/blog/linking-unlinkables-simple-automated-scalable-data-linking-databricks-arc

RobinL commented 6 days ago

Thanks. Yes. As far as I know, it doesn't get around this jar problem (but I would love to be corrected on that!)

fscholes commented 6 days ago

I'm aware of ARC, but I don't think it's been actively developed for quite some time now

aamir-rj commented 3 days ago

Thanks for replies

I spoke to databricks and they asked to run on single user cluster which works for fine.

Thanks

On Thu, 10 Oct 2024 at 11:18 AM fscholes @.***> wrote:

Incidentally enough, just yesterday I was talking to Databricks about this, and it's because Splink uses custom jars that aren't supported on shared mode/serverless clusters. They told me it is on the roadmap.

The current way around it, other than redeveloping Splink to remove the custom jars, is to use a single-user cluster on Databricks

— Reply to this email directly, view it on GitHub https://github.com/moj-analytical-services/splink/issues/2459#issuecomment-2404257929, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2FVYNUKZ5DAFHGGRBPPVILZ2YS6BAVCNFSM6AAAAABPT5M3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBUGI2TOOJSHE . You are receiving this because you authored the thread.Message ID: @.***>

ADBond commented 2 days ago

Another possible way around this for anyone using splink >= 3.9.10 is an option to opt-out of registering the custom jars. In Splink 4 this looks like:

...
spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)
linker = Linker(df, settings, db_api=spark_api)
...

See issue #1744 and option added in #1774.

aamir-rj commented 9 hours ago

I used the below option but always gets this error nam alt is not defined.

Tried installing alt but still same issue

Image

On Mon, 14 Oct 2024 at 1:05 PM ADBond @.***> wrote:

Another possible way around this for anyone using splink >= 3.9.10 is an option to opt-out of registering the custom jars. In Splink 4 this looks like:

...spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)linker = Linker(df, settings, db_api=spark_api) ...

See issue #1744 https://github.com/moj-analytical-services/splink/issues/1744 and option added in #1774 https://github.com/moj-analytical-services/splink/pull/1774.

— Reply to this email directly, view it on GitHub https://github.com/moj-analytical-services/splink/issues/2459#issuecomment-2410522648, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2FVYNSXAYMJKLKVMOFJNDLZ3OCPLAVCNFSM6AAAAABPT5M3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJQGUZDENRUHA . You are receiving this because you were mentioned.Message ID: @.***>

ADBond commented 7 hours ago

I used the below option but always gets this error nam alt is not defined. Tried installing alt but still same issue Image

From the image it looks like you are mixing Splink 2 (splink.Splink) and Splink 4 (splink.SparkAPI) code. The option register_udfs_automatically is only available in Splink 4 and later versions of Splink 3 - there is no equivalent in Splink 2. If you are able to upgrade then I would recommend moving to Splink 4, as then all of the documentation will be applicable.

If you are not able to upgrade then you will need to stick to the single user cluster workaround.

aamir-rj commented 5 hours ago

Nope ,I used splink 4 only.

On Wed, 16 Oct 2024 at 2:22 PM ADBond @.***> wrote:

I used the below option but always gets this error nam alt is not defined. Tried installing alt but still same issue image.png (view on web) https://github.com/user-attachments/assets/d7f152da-9289-4592-b369-28b674816a02

From the image it looks like you are mixing Splink 2 (splink.Splink) and Splink 4 (splink.SparkAPI) code. The option register_udfs_automatically is only available in Splink 4 and later versions of Splink 3 - there is no equivalent in Splink 2. If you are able to upgrade then I would recommend moving to Splink 4, as then all of the documentation https://moj-analytical-services.github.io/splink/getting_started.html will be applicable.

If you are not able to upgrade then you will need to stick to the single user cluster workaround.

— Reply to this email directly, view it on GitHub https://github.com/moj-analytical-services/splink/issues/2459#issuecomment-2416375641, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2FVYNT5UQNFSMVPGBF4RTDZ3Y44RAVCNFSM6AAAAABPT5M3RKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMJWGM3TKNRUGE . You are receiving this because you were mentioned.Message ID: @.***>

ADBond commented 5 hours ago

What I mean is that the code you appear to be running is not valid Splink 4 code - there is no longer an object Splink to import, nor a method get_scored_comparisons(). You will need to adjust your code to something like:

from splink import Linker, SparkAPI

...
spark_api = SparkAPI(spark_session=spark, register_udfs_automatically=False)
linker = Linker(df, settings, db_api=spark_api)

df_dedupe_result = linker.inference.predict()

As to your error, it appears that you do not have the requirement altair installed, which is why you get the error name 'alt' is not defined. If you install this dependency the error should go away.

aamir-rj commented 3 hours ago

Image

ADBond commented 3 hours ago

@aamir-rj do you have altair installed? It seems from your screenshot that it is not, but hard to tell as not all of the cell output is visible. You can check by seeing what happens if you try to run import altair.

It should install automatically with splink as it is a required dependency - but if not for some reason you can run pip install altair==5.0.1 to get it.