tlabs-data / tablesaw-parquet

Parquet IO for Tablesaw
Apache License 2.0
11 stars 1 forks source link

Minimal dependency on Hadoop and its transitive dependencies #69

Closed aecio closed 2 years ago

aecio commented 2 years ago

Adding tablesaw-parquet as a dependency in a project brings in several transitive dependencies, including versions of popular libraries such as guava, jetty, jersey, zookeeper, jackson, and many others. It seems that these transitive dependencies are mainly due to the dependency on hadoop-common and hadoop-mapreduce-client-core.

Given that most of these are not necessary for reading the Parquet file format, I wonder how feasible it is to remove these dependencies or at least reduce them to the minimum required.

PS: Thanks for working on and sharing this great library!

ccleva commented 2 years ago

Hi @aecio, thank you for your feedback.

For sure the project comes with a lot of baggage, and you are right this is mainly due to the hadoop-common and hadoop-mapreduce-client-core dependencies. Unfortunately, both are transitive dependencies (with provided scope, hence the inclusion here) of the parquet-hadoop library we are actively using. The project either doesn't compile or fails most tests if we exclude one of these dependencies.

Note that getting rid of the dependency on hadoop is an open issue on the parquet-mr JIRA. Until this issue is resolved we need both hadoop libraries for the project to work.

In the meantime, you can exclude some of the transitive dependencies yourself if needed. I was able to run one of my projects using this library while excluding all jetty artifacts coming from hadoop-common with the following pom:

    <dependency>
        <groupId>tech.tablesaw</groupId>
    <artifactId>tablesaw-core</artifactId>
    <version>0.43.1</version>
    </dependency>
    <dependency>
    <groupId>net.tlabs-data</groupId>
    <artifactId>tablesaw_0.43.1-parquet</artifactId>
    <version>0.10.0</version>
        <exclusions>
            <exclusion>
                <groupId>org.eclipse.jetty</groupId>
                <artifactId>*</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

While I am fairly certain some transitive dependencies could be excluded at this project level, I would rather let users do it themselves if needed than breaking an untested use case.

aecio commented 2 years ago

Thanks for your response and for pointing out the JIRA issue. It is unfortunate that the JIRA issue has not seen much progress in more than a year.

The answer to this question on Stackoverflow shows a list of other dependencies that can be excluded. This link may be helpful to anyone trying to exclude the Hadoop dependencies manually.

ccleva commented 2 years ago

Thank you for the link @aecio. I confirm all unit tests pass with this list of dependencies excluded (replace org.mortbay.jetty by org.eclipse.jetty for the current hadoop version). Everything also works fine when using the built library with these dependencies excluded.

I will add proper documentation on the topic and close this issue, if this is fine with you. We can reopen it later if the situation evolves.

aecio commented 2 years ago

Thanks for confirming, @ccleva. Feel free to close the issue.