philterd / phileas

The open source PII and PHI redaction engine
https://www.philterd.ai
Apache License 2.0
21 stars 4 forks source link

Reduce required third-party dependencies for Phileas core by removing `phileas-metrics-service` #122

Closed RobDickinson closed 2 weeks ago

RobDickinson commented 1 month ago

Phileas is pretty svelte with the exception of PhileasMetricsService, which pulls in the io.micrometer packages, and leads to a large number of transitive dependencies.

These transitive dependencies have several negative impacts:

This issue was first identified with phileas-benchmark, which generates a single-jar executable using maven-assembly-plugin. Using the built-in jar-with-dependencies configuration results in a 270MB jar file.

A relatively easy workaround is to use a custom assembly configuration, which reduces the phileas-benchmark jar size to just 37MB, but requires explicitly including all of the packages required:

<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.2.0"
          xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.2.0 https://maven.apache.org/xsd/assembly-2.2.0.xsd">
    <id>cmd</id>
    <formats>
        <format>jar</format>
    </formats>
    <includeBaseDirectory>false</includeBaseDirectory>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/</outputDirectory>
            <includes>
                <include>ai.philterd:phileas-benchmark:jar:</include>
                <include>ai.philterd:phileas-core:jar:</include>
                <include>ai.philterd:phileas-model:jar:</include>
                <include>ai.philterd:phileas-processors-unstructured:jar:</include>
                <include>ai.philterd:phileas-services-alerts:jar:</include>
                <include>ai.philterd:phileas-services-anonymization:jar:</include>
                <include>ai.philterd:phileas-services-disambiguation:jar:</include>
                <include>ai.philterd:phileas-services-metrics:jar:</include>
                <include>ai.philterd:phileas-services-policies:jar:</include>
                <include>com.googlecode.libphonenumber:libphonenumber:jar:</include>
                <include>io.micrometer:micrometer-core:jar:</include>
                <include>io.micrometer:micrometer-registry-cloudwatch:jar:</include>
                <include>io.micrometer:micrometer-registry-datadog:jar:</include>
                <include>io.micrometer:micrometer-registry-jmx:jar:</include>
                <include>io.micrometer:micrometer-registry-prometheus:jar:</include>
                <include>org.json:json:jar:</include>
            </includes>
            <useProjectArtifact>true</useProjectArtifact>
            <unpack>true</unpack>
            <scope>runtime</scope>
        </dependencySet>
    </dependencySets>
</assembly>

👆 This works but is tricky if this is the norm for Phileas users. The resulting jar is also still larger than necessary when the MetricsService implementation isn't being activated.

The phileas-connector uses similar includes for building the Trino connector, which isn't done using maven-assembly-plugin but with similar tooling.

Refactoring PhileasMetricsService as a dynamically-loaded implementation of MetricsService would keep the io.micrometer dependencies out of the Phileas core -- and open up the possibility of writing other MetricsService implementations (including allowing phileas-connector to publish metrics tables for Trino users, and a "blackhole" or in-memory implementation to use by default).

jzonthemtn commented 1 month ago

@RobDickinson Thanks for typing this up. Agreed that Phileas should be more lighter weight. I will take this one on since it might involve moving the metrics stuff out into its own GitHub repository to simplify the code and make it more loosely connected.

jzonthemtn commented 2 weeks ago

Because Phileas is a library to do redaction, it has to be used from within another application. I don't think it is necessary for Phileas to have an integrated implementation of MetricsService when the application implementer can easily add their own metric collection and have more flexibility when doing so.

The phileas-metrics-service will be removed from Phileas and integrated with Philter so the functionality can still be used, but Phileas will now let users their own implementation of MetricsService.

This however does not address the issue of the size of the jar file. The ONNX Runtime dependencies are responsible for a very large part of the 270 MB jar file.

Wrote #134 to take a better look at the size of the dependencies.