Closed aecio closed 2 years ago
Hi @aecio, thank you for your feedback.
For sure the project comes with a lot of baggage, and you are right this is mainly due to the hadoop-common
and hadoop-mapreduce-client-core
dependencies. Unfortunately, both are transitive dependencies (with provided
scope, hence the inclusion here) of the parquet-hadoop
library we are actively using. The project either doesn't compile or fails most tests if we exclude one of these dependencies.
Note that getting rid of the dependency on hadoop is an open issue on the parquet-mr JIRA. Until this issue is resolved we need both hadoop libraries for the project to work.
In the meantime, you can exclude some of the transitive dependencies yourself if needed. I was able to run one of my projects using this library while excluding all jetty artifacts coming from hadoop-common
with the following pom:
<dependency>
<groupId>tech.tablesaw</groupId>
<artifactId>tablesaw-core</artifactId>
<version>0.43.1</version>
</dependency>
<dependency>
<groupId>net.tlabs-data</groupId>
<artifactId>tablesaw_0.43.1-parquet</artifactId>
<version>0.10.0</version>
<exclusions>
<exclusion>
<groupId>org.eclipse.jetty</groupId>
<artifactId>*</artifactId>
</exclusion>
</exclusions>
</dependency>
While I am fairly certain some transitive dependencies could be excluded at this project level, I would rather let users do it themselves if needed than breaking an untested use case.
Thanks for your response and for pointing out the JIRA issue. It is unfortunate that the JIRA issue has not seen much progress in more than a year.
The answer to this question on Stackoverflow shows a list of other dependencies that can be excluded. This link may be helpful to anyone trying to exclude the Hadoop dependencies manually.
Thank you for the link @aecio. I confirm all unit tests pass with this list of dependencies excluded (replace org.mortbay.jetty
by org.eclipse.jetty
for the current hadoop version). Everything also works fine when using the built library with these dependencies excluded.
I will add proper documentation on the topic and close this issue, if this is fine with you. We can reopen it later if the situation evolves.
Thanks for confirming, @ccleva. Feel free to close the issue.
Adding
tablesaw-parquet
as a dependency in a project brings in several transitive dependencies, including versions of popular libraries such as guava, jetty, jersey, zookeeper, jackson, and many others. It seems that these transitive dependencies are mainly due to the dependency onhadoop-common
andhadoop-mapreduce-client-core
.Given that most of these are not necessary for reading the Parquet file format, I wonder how feasible it is to remove these dependencies or at least reduce them to the minimum required.
PS: Thanks for working on and sharing this great library!