lintool / warcbase

Warcbase is an open-source platform for managing analyzing web archives
http://warcbase.org/
161 stars 47 forks source link

Dockerize Warcbase #239

Closed ianmilligan1 closed 6 years ago

ianmilligan1 commented 8 years ago

I've used @ibnesayeed's Archive Spark image – discussed here – and we should explore doing something similar with warcbase now that the new warcbase-core setup is stabilized. I think it'd be a really good way to help people install warcbase!

ibnesayeed commented 8 years ago

I should be able to help with this. Please let me know if and when the documentation is updated to meet the new refactoring and where can I learn to setup the minimal basic configuration that works with plain file system on a single machine. It doesn't mean that clusters, HDFS, and other complex setups are not possible with docker, but such configurations cause confusion and friction for new users. Docker's philosophy is, "batteries included, but replaceable" which means suitable out of the box defaults are good to have. For inspiration, "under the hood" section of the above mentioned blog post should be useful where build process is explained.

ianmilligan1 commented 8 years ago

These installation instructions are up to date and in theory comprehensive, although things are issues. It's designed to get a basic warcbase version working w/ plain file system on a personal system.

Everything should have been updated for the new refactoring, and we've tested these instructions. That said, if you do catch anything, feedback always welcome!

ibnesayeed commented 8 years ago

I was able to dockerize it, but I still need to make it prettier and easier by cleaning up a few things, configuring the notebook, and perhaps automating a couple things such as somehow loading the fatjar file as the notebook is instantiated. I am not sure if the --jars flag would work or no, or if I need to customize something the config file of the notebook. Another thing that I don't see is the automatic plot that is shown in the Getting Started guide, what do I need to do in order to get that?

ryanfb commented 7 years ago

For what it's worth, I made an experimental Dockerfile and Docker image based on @ianmilligan1's Warcbase install guide. It's based off the official Maven Docker image.

Once the automated Docker Hub build is finished, you should be able to run:

docker run -ti ryanfb/warcbase

To get dropped into an interactive Spark shell with Warcbase.

ibnesayeed commented 7 years ago

Spark shell version was something that I built a long time ago, but I was trying to make the Spark Notebook based image available as well (in fact a single image with both while notebook as the default). I was able to do that as well, except, I could not get rid of the cp: /path/to/fat.jar line from each notebook as discussed in https://github.com/spark-notebook/spark-notebook/issues/674. This is not really a big deal, but I feel that information does not belong to the notebook file and should be hidden from the user.

anjackson commented 7 years ago

@ryanfb Snap! I've been working on a branch that updates the Scala/Spark/Hadoop/HBase versions and got it running. The Spark Notebook worked fine using EXTRA_CLASSPATH locally (rather than cp: /path/to/fat.jar), and I'm now trying to finish off the Dockerfile.

I'm also going to try getting it to run in Jupyter (via Apache Toree) and Apache Zepplin rather than Spark Notebook. I think it should be fine as long as all the versions match up.

ianmilligan1 commented 7 years ago

This is great @ryanfb (and thanks again @ibnesayeed for your earlier work too!). Works on OS X. Is there a way to pass flags to the spark-shell command (I'm not familiar with Docker), as at times one needs to manually pass ./bin/spark-shell some extra memory through a flag.

And awesome @anjackson!

ryanfb commented 7 years ago

@ianmilligan1 I've just pushed a commit that should change the image so that it will support that: https://github.com/ryanfb/warcbase/commit/1a3b63dd1b663e634b896c6d27f28510cd8abf05

Once that build is finished, you should be able to run docker pull ryanfb/warcbase (to update your image if you've already run it before) and then use:

docker run -ti ryanfb/warcbase --example-extra-flag "goes here"

If you need to override the entrypoint to get into the container and/or run a completely different command inside it, you should be able to use the --entrypoint argument for docker run.

ibnesayeed commented 7 years ago

@ianmilligan1: Is there a way to pass flags to the spark-shell command (I'm not familiar with Docker), as at times one needs to manually pass ./bin/spark-shell some extra memory through a flag.

It really depends on how the Image is configured, especially the ENTRYPOINT and CMD directives of the Dockerfile. The general answer would be "yes", but the proper way to do so can only be described after knowing how the image was built. Understand how CMD and ENTRYPOINT interact.

ibnesayeed commented 7 years ago

@anjackson, while you are on it, I would encourage utilizing the new Multi-Stage Build feature of Docker.

ianmilligan1 commented 6 years ago

With depreciation of warcbase and migration to AUT, this can now be found at https://github.com/archivesunleashed/docker-aut.