tmills / ctakes-docker

Apache License 2.0
23 stars 18 forks source link

ctakes containers are still too big #4

Open tmills opened 7 years ago

tmills commented 7 years ago

@MatthewVita asked in his last pull request:

One question for you (I can do this in a separate PR): should we just commit in the cTAKES zip artefacts (can COPY them in via Dockerfile)? The download site takes forever to pull them down. I realize this may not be a best practice, but...

I would like to do something about this, but not crazy about adding even more jars (I'd like to remove the jars currently checked in at some point). It might be possible to just pick the individual jars we need with wget from apache servers? Still (maybe) slow servers but avoiding the dependency parser alone would cut 250Mb from the download size.

tmills commented 7 years ago

Or mavenize everything and let maven figure out which jars to grab? IDK if it's standard to include maven in containers, it's certainly has a heavy enough footprint on its own.

MatthewVita commented 7 years ago

I've used Maven in a containerized setting. Sounds like a great idea because Maven central servers are fast. However, it may not help at all with the container size problem. Hmm.

tmills commented 6 years ago

Looked into maven a bit, it can help us with the jars but probably not with the uima and ctakes downloads. Since it downloads the entire internet to compile one java class, I doubt it's faster or smaller than the way it's set up now.

MatthewVita commented 6 years ago

Agreed

MatthewVita commented 6 years ago

Taking a step back, I don't think the container size is actually the issue here. It's more the download times. For instance, the Apache servers that we download from for the pipeline image take forever. Perhaps we can just "pull the pain forward" and commit the files into the repo and COPY them?