[Suggestion]: Make `pyspark` an optional dependency

FredrikBakken commented 2 months ago

Hi 👋

We are currently experimenting with using sparkdantic on our Spark schema definitions in our pipelines inside Databricks. However, based on our current configuration, we are bound to installing all dependencies inside notebook scopes - rather than installing dependencies on the cluster-level. This means we need to run the !pip install command for each dependency used in the beginning of our notebooks.

We've noticed that the sparkdantic installation is taking a lot of time to install as it also installs pyspark as part of its dependencies, even though spark is already available inside the Databricks environment. A potential solution for this is to move pyspark to become an optional dependency, rather than a mandatory dependency.

Any thoughts on this suggestion?

jaceklaskowski commented 2 months ago

I've noticed it too lately and would much appreciate this change ❤️

mitchstockdale commented 1 week ago

Hey @FredrikBakken, thanks for the suggestion, and apologies for the delayed response.

I’m in favour of this and am happy to implement it in the near future if no one else gets around to it.

mitchelllisle commented 1 day ago

Sorry for the delay getting back on this one. Haven't had much time to work on this lately but I'm ok with this approach since spark is often a provided dependency in many other implementations.

@mitchstockdale I'll have some time this week to do this and want to bundle in another change raised but if you get around to it first let me know and I'll cut a new release.

Thanks for the suggestions and appreciate the patience!

mitchstockdale commented 1 day ago

@mitchelllisle - I haven't had a look at this yet but I was thinking about it over the last week or so, and have an additional proposal to enhance it #536

mitchelllisle / sparkdantic

[Suggestion]: Make `pyspark` an optional dependency #456