vertica / spark-connector

This component acts as a bridge between Spark and Vertica, allowing the user to either retrieve data from Vertica for processing in Spark, or store processed data from Spark into Vertica.
Apache License 2.0
20 stars 22 forks source link

Publish non-uber jars and/or relocate vendor libraries #157

Closed ches closed 2 years ago

ches commented 3 years ago

Hi,

It seems the connector is published as an assembly JAR, without relocating packages of the third-party libraries. That's generally a frowned-upon practice.

For me it manifests as a problem because the connector depends on scalactic in compile scope. Your version of Scalactic has a binary incompatibility with an older ScalaTest version that I'm currently limited to using in the test suite of an industrial project. I can't use build tool facilities to try excluding your Scalactic version, because it isn't a managed transitive dependency, vertica-spark brings it directly to the classpath in /org/scalactic. This is hard to debug, since the source of errors is not the Scalactic version that I see in my dependency tree from my ScalaTest version, it's a sneaky one.

The connector might be able to eliminate use of Scalactic as a runtime dep, but that isn't the root issue, this will likely arise with another vendored library for other users.

I understand there is a desire for shaded uber-jar packages in some cases. I believe good practice, and what I'd like to suggest, is:

  1. Publish an artifact with no shading, allowing users to manage transitive dependencies with the facilities of their build tools.
  2. Publish an artifact with shaded dependencies under a classifier such as shaded or assembly, as an expedient option for users that find binary compatibility problems in their builds.
  3. The shaded artifact must relocate all vendor packages to avoid classpath conflicts in user builds.

I feel the first should be the default artifact, with no classifier—it's best to promote modularity as the first option—but you might invert them if you prefer.

I can try taking a stab at the sbt setup for this, if maintainers agree with it.

Thanks for your work on the new connector!

alexr-bq commented 3 years ago

Hey,

Agreed on almost all of this, and happy if you want to take a stab at implementing the sbt setup. The only point that I would bring up is we probably do want to maintain the assembly jar as a default. The reason for this is this may be used from a scala or non-scala environment where the user just wants to drop in a single jar file and interface with the connector. While promoting modularity is nice, that seems harder if someone is trying to use the connector in a PySpark environment for example. With that said, the dependencies should be shaded properly and your suggestions here are great.

Thanks.

jeremyprime commented 2 years ago

We now publish a slim JAR along with the regular uber/shaded JAR, just use the version with -slim appended to it.

For example:

libraryDependencies += "com.vertica.spark" % "vertica-spark" % "3.0.2-slim"