internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.79k stars 759 forks source link

Docker #360

Open Querela opened 3 years ago

Querela commented 3 years ago

I wrote a Docker file for the current version(s). Maybe you want to look into it and integrate it here.
It works for me but I only have some simple use-cases (like API tests with python3), so I do not know how it performs under stress. And whether users require more configuration options. (But they could theoretically bind-mount other files if required.)

See Docker-Hub: https://hub.docker.com/r/ekoerner/heritrix

My Dockerfile (currently in private repository, so I can't provide any link, just the content here)

ARG java=11-jre

FROM openjdk:${java}

ARG version="3.4.0-20210923"
ARG contrib=0
ARG user="heritrix"
ARG userid=1000

LABEL version=${version}
LABEL contrib=${contrib}
LABEL user=${user}/$userid

# create user
RUN \
    groupadd -g $userid $user && \
    useradd -r -u $userid -g $user $user

# install other requirements (for contrib)
RUN \
    if [ ${contrib} -eq 1 ] ; then \
        apt-get update && \
        apt-get install -y --no-install-recommends \
            youtube-dl && \
        rm -rf /var/lib/apt/lists/* ; \
    fi

WORKDIR /opt

# download latest version according to:
#   https://github.com/internetarchive/heritrix3/releases/tag/3.4.0-20210923
RUN \
    if [ ${contrib} -eq 1 ] ; then \
        wget -O heritrix-contrib-${version}-dist.tar.gz https://repo1.maven.org/maven2/org/archive/heritrix/heritrix-contrib/${version}/heritrix-contrib-${version}-dist.tar.gz && \
        tar xvfz heritrix-contrib-${version}-dist.tar.gz && \
        rm heritrix-contrib-${version}-dist.tar.gz && \
        mv heritrix-contrib-${version} heritrix ; \
    else \
        wget -O heritrix-${version}-dist.zip https://repo1.maven.org/maven2/org/archive/heritrix/heritrix/${version}/heritrix-${version}-dist.zip && \
        unzip heritrix-${version}-dist.zip && \
        rm heritrix-${version}-dist.zip && \
        mv heritrix-${version} heritrix ; \
    fi && \
    chmod u+x heritrix/bin/heritrix && \
    chown -R $user:$user /opt/heritrix

# create a run script because dynamic configuration of credentials
RUN printf '%s\n' \
    '#!/bin/bash' \
    '' \
    '_JOBARGS="-b /"' \
    '' \
    '# set credentials (require both USERNAME and PASSWORD)' \
    '# -a "${USERNAME}:${PASSWORD}"' \
    'if [[ ! -z "$USERNAME" ]] && [[ ! -z "$PASSWORD" ]]; then' \
    '    echo "${USERNAME}:${PASSWORD}" > ${HERITRIX_HOME}/credentials.txt' \
    '    _JOBARGS="$_JOBARGS -a @${HERITRIX_HOME}/credentials.txt"' \
    'elif [[ ! -z "$CREDSFILE" ]]; then' \
    '    _JOBARGS="$_JOBARGS -a @${CREDSFILE}"' \
    'else' \
    '    >&2 echo "No USERNAME and/or PASSWORD environment var set!"' \
    'fi' \
    '' \
    '# check if -r mode' \
    'if [[ ! -z "$JOBNAME" ]]; then' \
    '    >&2 echo "Found JOBNAME envvar, just running job: $JOBNAME"' \
    '    _JOBARGS="$_JOBARGS -r $JOBNAME"' \
    '    if [ ! -f "/opt/heritrix/jobs/$JOBNAME/crawler-beans.cxml" ]; then' \
    '        >&2 echo "Did not find any '"'"'crawler-beans.cxml'"'"' for job '"'"'$JOBNAME'"'"'!"' \
    '    fi' \
    'fi' \
    '' \
    '# run' \
    'exec ${HERITRIX_HOME}/bin/heritrix $_JOBARGS' \
    '' \
    > heritrix.sh && \
    chmod +x heritrix.sh && \
    chown $user:$user heritrix.sh

WORKDIR /opt/heritrix

USER $user

ENV HERITRIX_HOME /opt/heritrix
# let it run in the foreground, required for docker
ENV FOREGROUND true

# standard webport
# NOTE: that the webpage is via HTTPS only available!
EXPOSE 8443

CMD ["/opt/heritrix.sh"]

Build it:

docker build --build-arg version=3.4.0-20210923 -t heritrix .

Build heritrix-contrib (requires Java 8, with Java 11 (JRE/JDK) some JNI error, maybe related to #265?)

docker build --build-arg version=3.4.0-20210923 --build-arg contrib=1 --build-arg java=8-jre -t heritrix-contrib .

Example docker-compose.yml (also on DockerHub currently)

version: "3.7"
services:

  heritrix:
    build: .
    container_name: "heritrix"
    # TEST: keeps the container running without doing anything (for inspections)
    # entrypoint: bash -c 'while :; do :; done & kill -STOP $$! && wait $$!'
    # env_file: .env
    environment:
      - USERNAME=admin
      - PASSWORD=admin
      # optional jobname to run (will only run this single job and exit!)
      # - JOBNAME=myjob
      # - JAVA_OPTS=-Xmx1024M
    init: true
    ports:
      # if you want to use a .env file with `PORT=8443` for example
      # - ${PORT}:8443
      - 8443:8443
    restart: unless-stopped
    volumes:
      # where jobs will be stored
      - job-files:/opt/heritrix/jobs
      # or if JOBNAME envvar is used (mount just the single job folder)
      # jobfolder in the container needs to have the same name as in JOBNAME
      # - $(pwd)/host_myjob:/opt/heritrix/jobs/myjob

volumes:
  job-files:

UPDATE: I added the -r <jobname> option to my image on dockerhub. Simply set the JOBNAME=jobname environment variable to run the job jobname. Take care to mount the (preconfigured) job folder into the image, see above. Only works from version 3.4.0-20210803, see pull request #406. UPDATE2: I added a contrib image that uses heritrix-contrib. For now it only includes youtube-dl as extra dependency and it only works with Java 8 JRE. The contrib image is only available from version 3.4.0-20210923. UPDATE3: Added a custom user to make it a bit more secure (e. g., no package installs possible anymore). Note that -b / is required to make the web UI visible in the docker image.

818S commented 3 years ago

+1

ato commented 3 years ago

Just noting that if anyone would like to see a Dockerfile merged please submit it as a pull request and include the documentation/examples you feel appropriate. I'm willing merge it and connect it to Docker Hub under the IIPC group but I don't use Docker much myself so you'll need to do the legwork and testing. :-)

Querela commented 3 years ago

I find myself unable to really stress-test my own docker image. It works for some toy samples but I'm not sure about more involved scenarios and how docker handles this. Mine was more for short-term and low url count crawls. 😃 I also think the configuration handling can be improved by a lot. In my use case I just needed the most basic things but I saw use-cased on the internet that did much more. So, I'm not sure whether my image might be a good "official" image. (But I will still update my dockerhub images with each new release here. And the code above is my most current version.)

Querela commented 3 years ago

I added the -r <jobname> flag into my image. This is option really nice and makes automation easier. I updated the first comment of the issue.

Querela commented 2 years ago

So, after a request I added a heritrix-contrib docker image (same docker hub URL, just :contrib tag). But I had difficulties finding any documentation about the contrib stuff. I found the javadocs but nowhere was mentioned how to set it up, what other requirements are there (e.g. for the various extractors, ...) and so on. I also found that it only worked with Java 8 and not with Java 11.

Now my Dockerfile gets to the point that it might make sense to create a pull request. What exactly would be required? I'm especially puzzled about tests since I can do some manual tests but how would I do automated stuff?

ato commented 2 years ago

All I had in mind was a a pull request that adds the Dockerfile itself and maybe a section named something like 'Running Heritrix under Docker' with some brief usage instructions to docs/operating.rs. By testing I just meant manually verifying the instructions work not automated tests. :-)

Querela commented 2 years ago

Ok. I'm working on it. I did extract the entrypoint script outside, so it is a bit easier to edit. And a separate Dockerfile for the heritrix-contrib image. And I added a Makefile to create the images.

I did not yet add a description on how to build the docker image. Would a README.md be enough in the docker folder or a wiki page (currently in my fork only)? I would suggest running docker with the official images, so the image build process uses the maven releases and does not build from the sources again.

I found the following Docker Hub users:

Which should then also be used in the documentation. (instead of just heritrix)

ato commented 2 years ago

Thanks. That looks great.

I've merged it and pushed the main and contrib images to iipc/heritrix. I had intended to automate this with the autobuilder but it seems the free tier of that has been discontinued. I'll look into alternative options but I guess it's not too difficult to build them manually after each release.

I used the IIPC Docker org because the Heritrix "interim" releases are currently maintained by some members of the IIPC community and several of us (including someone from IA) have access to that org.

Querela commented 2 years ago

I can take a look at using GH Actions. It seems to me that the tags correspond to the releases. So, build the docker image after a new tag is pushed, or on a new release (tag) has been added. I think it should be possible to extract the current or latest tag to supply the build arg. Or alternatively, manually update the standard release number for each release in the Dockerfile.

Then, we can probably also transfer all the old images from my hub account to the iipc one, if necessary? I will later clear out my hub repo to remove confusion. But no concrete time plan yet.

And thanks about the IIPC explanation. :-)

As for the tags, I had -jre in case a -jdk base image might be added later on, and where subsequent users would want to base their custom images on either one, depending on their requirements and to-be-installed software.

Then, I also added the Docker wiki page. If anyone plans to rename it, please update the link in docker/README.md. I updated wiki: HOWTO Ship a Heritrix Release.