Open Querela opened 3 years ago
+1
Just noting that if anyone would like to see a Dockerfile merged please submit it as a pull request and include the documentation/examples you feel appropriate. I'm willing merge it and connect it to Docker Hub under the IIPC group but I don't use Docker much myself so you'll need to do the legwork and testing. :-)
I find myself unable to really stress-test my own docker image. It works for some toy samples but I'm not sure about more involved scenarios and how docker handles this. Mine was more for short-term and low url count crawls. 😃 I also think the configuration handling can be improved by a lot. In my use case I just needed the most basic things but I saw use-cased on the internet that did much more. So, I'm not sure whether my image might be a good "official" image. (But I will still update my dockerhub images with each new release here. And the code above is my most current version.)
I added the -r <jobname>
flag into my image. This is option really nice and makes automation easier.
I updated the first comment of the issue.
So, after a request I added a heritrix-contrib
docker image (same docker hub URL, just :contrib
tag). But I had difficulties finding any documentation about the contrib
stuff. I found the javadocs but nowhere was mentioned how to set it up, what other requirements are there (e.g. for the various extractors, ...) and so on. I also found that it only worked with Java 8 and not with Java 11.
Now my Dockerfile
gets to the point that it might make sense to create a pull request. What exactly would be required? I'm especially puzzled about tests since I can do some manual tests but how would I do automated stuff?
All I had in mind was a a pull request that adds the Dockerfile itself and maybe a section named something like 'Running Heritrix under Docker' with some brief usage instructions to docs/operating.rs. By testing I just meant manually verifying the instructions work not automated tests. :-)
Ok. I'm working on it.
I did extract the entrypoint script outside, so it is a bit easier to edit. And a separate Dockerfile for the heritrix-contrib
image.
And I added a Makefile to create the images.
I did not yet add a description on how to build the docker image. Would a README.md
be enough in the docker
folder or a wiki page (currently in my fork only)?
I would suggest running docker with the official images, so the image build process uses the maven releases and does not build from the sources again.
I found the following Docker Hub users:
Which should then also be used in the documentation. (instead of just heritrix
)
Thanks. That looks great.
I've merged it and pushed the main and contrib images to iipc/heritrix. I had intended to automate this with the autobuilder but it seems the free tier of that has been discontinued. I'll look into alternative options but I guess it's not too difficult to build them manually after each release.
I used the IIPC Docker org because the Heritrix "interim" releases are currently maintained by some members of the IIPC community and several of us (including someone from IA) have access to that org.
I can take a look at using GH Actions. It seems to me that the tags correspond to the releases. So, build the docker image after a new tag is pushed, or on a new release (tag) has been added. I think it should be possible to extract the current or latest tag to supply the build arg. Or alternatively, manually update the standard release number for each release in the Dockerfile.
Then, we can probably also transfer all the old images from my hub account to the iipc one, if necessary? I will later clear out my hub repo to remove confusion. But no concrete time plan yet.
And thanks about the IIPC explanation. :-)
As for the tags, I had -jre
in case a -jdk
base image might be added later on, and where subsequent users would want to base their custom images on either one, depending on their requirements and to-be-installed software.
Then, I also added the Docker wiki page. If anyone plans to rename it, please update the link in docker/README.md
.
I updated wiki: HOWTO Ship a Heritrix Release.
I wrote a Docker file for the current version(s). Maybe you want to look into it and integrate it here.
It works for me but I only have some simple use-cases (like API tests with python3), so I do not know how it performs under stress. And whether users require more configuration options. (But they could theoretically bind-mount other files if required.)
See Docker-Hub: https://hub.docker.com/r/ekoerner/heritrix
My
Dockerfile
(currently in private repository, so I can't provide any link, just the content here)Build it:
Build
heritrix-contrib
(requires Java 8, with Java 11 (JRE/JDK) some JNI error, maybe related to #265?)Example
docker-compose.yml
(also on DockerHub currently)UPDATE: I added the
-r <jobname>
option to my image on dockerhub. Simply set theJOBNAME=jobname
environment variable to run the jobjobname
. Take care to mount the (preconfigured) job folder into the image, see above. Only works from version 3.4.0-20210803, see pull request #406. UPDATE2: I added acontrib
image that usesheritrix-contrib
. For now it only includesyoutube-dl
as extra dependency and it only works with Java 8 JRE. Thecontrib
image is only available from version 3.4.0-20210923. UPDATE3: Added a custom user to make it a bit more secure (e. g., no package installs possible anymore). Note that-b /
is required to make the web UI visible in the docker image.