jonaswinkler / paperless-ng

A supercharged version of paperless: scan, index and archive all your physical documents
https://paperless-ng.readthedocs.io/en/latest/
GNU General Public License v3.0
5.37k stars 355 forks source link

Automatically build container images with CI/CD #151

Closed Scrumplex closed 3 years ago

Scrumplex commented 3 years ago

I would love to see multiple architectures supported with the docker container. Sadly it currently only supports amd64 which is probably because the image is built on a PC. I would suggest automatic builds and deployments of various images using a CI/CD platform (for example GitLab CI).

Notably I would suggest the following CPU architectures for the image(s):

Maybe we could also throw a x86 build in there if anyone wants it.

Open questions

  1. What CI to use?
  2. Which Container Registry to use? (As Docker Hub is getting more and more restrictive, we might be able to use an alternative like quay.io or GitLab Container Registry)
  3. Should we provide an Alpine-based image in addition to the Debian-based image?
qcasey commented 3 years ago

This is the kind of project that Raspberry Pi users would get a lot of value from. A Docker image would be great.

jonaswinkler commented 3 years ago

Hey, and thanks for your interest in the project!

I'd like to get this automated and also have automated builds for armv7 and aarrch64.

I don't have all that much experience with CI systems and don't really know how to publish multi arch images on the hub, but it would certainly be much better for Raspberry Pi users than what we currently have.

Regarding Alpine: What's the advantage of that? A pretty major part of the image are PDF and image processing libraries and numerical processing libraries for the machine learning part, plus their dependencies. It won't get all that much smaller. Here's another thing why Alpine won't work (that well): paperless depends on numpy, and that comes with binaries. On Alpine, apparently due to the way they packaged/compiled libc, the pre-compiled wheels of numpy cannot be used, and have to be compiled from source. That's okay-ish on fast PCs, but on Raspberry Pi, that's an absolute no go. Literally takes a day to build.

Furthermore, some previous versions of the Dockerfile included the build of the front end, as another stage. That worked, but not on Raspberry Pi, since the compiling tools are rather resource hungry. Therefore, I'm using a shell script right now, which compiles the front end, copies files into place and then build the image from that. Far from ideal, since that involves a lot of manual effort. However, the resulting archives build fine and reasonably fast on RPi.

Questions I have regarding this.

If someone want's to help me out with this, I'm all ears, but I'd like to focus on actually getting features implemented and tested.

Scrumplex commented 3 years ago

Okay scrap that about Alpine. The advantage is mostly it's low resource usage (both memory and storage), but it would make things much more difficult with this wide set of dependencies.

Now about your specific question: Most CI systems offer you are wide selection of environments. In our case we just need a build runner that's on armv7 and aarch64 respectively. While ARM may not be as universal as x86 (and amd64), the binaries themselves are. So that should be really easy.

As Dockerfile are generally only used for packaging binaries, building the project itself would still be handled from the outside.

Take this example I would create on GitLab CI:

GitLab CI has multiple stages. In our case we would have build, package and deploy stages. In the build stage we build a fresh copy of the project. Traditionally this would be where we compile the project itself. In this case we would probably only need to build the frontend. The next stage package would be the docker step. Meaning we run docker build against our previously prepared / compiled project. Lastly we got the deploy step. In that case we just need to push our images to a container registry and should be done by now.

Now there are many more CI systems out there and from a quick look I don't think GitLab CI is suitable here, as I don't see any public ARM GitLab runners. So that would have to be self hosted which is not sustainable for a FOSS project long term.

Maybe someone from the community has an idea. I think Travis CI (which you are using currently anyway) provides multiple architectures.

Tooa commented 3 years ago

My two cents: There is a group of guys called LinuxServer.io who build and maintain the largest collection of Docker images on the web. These guys do great work and maintain tons of images. Their images are streamlined and regularly updated.

A few days ago, I asked about a multi-architecture image here, because I know some guys from the group are associated with the linked repository. In the same vein, you could reach out to the guys via their discord server and ask for a paperless-ng image.

Instead of going our own way here, we could even contribute a paperless-ng image by using their docker build infrastructure and principles - in case they don't have enough resources to create the image on their own.

jonaswinkler commented 3 years ago

Thanks for the heads up. I'll keep that in mind and will contact them for advice once we get most of the bugs and features ironed out.

shamoon commented 3 years ago

Currently, does the docker build get automatically updated with releases?

jonaswinkler commented 3 years ago

No, but it's all in a script, so not all that hard.

MarkSchmitt commented 3 years ago

The original paperless project has an open, discussed and never merged pull request to create multi-arch docker images with travis that I created a couple of months ago. I've extended the existing travis ci and added use of the travis arm64 (native) support and used qemu for other, non-x86 archs. In theory, all docker archs supported by qemu can be created this way. It's not super clean, the images are fine I think, the pipeline could be optimized a bit I guess.

You can find logs from the pipelines here: https://travis-ci.org/github/MarkSchmitt/paperless

The resulting images here: https://hub.docker.com/repository/docker/moztr/paperless-travis

And the pull request that modified the original paperless so it would build them: https://github.com/the-paperless-project/paperless/pull/674

Personally I use those images in a multi-arch k8s (k3s.io) cluster that has arm64, arm32v7 and x86_64 in it. The underlying storage is shared with NFS. So far I have not had any issues, and I've been doing that for a year, using it for my personal stuff (albeit I think I did a somewhat uncommon migration to postgres,some of that manually because I had some weird issues with the django migration .. but that's something for another post to write in).

I think I'll give building multi-arch paperless-ng images a go locally and see if I can create a merge request.

I would vote for using travis-ci and some docker repository, doesn't really matter which, it's just one deployment key away...

Any thoughts?

jonaswinkler commented 3 years ago

If you get this running I'd be happy to merge that and start building images for multiple architectures.

Here are a couple notes about the current setup, which works, but I'm not particularly happy about it.

Finally,

MarkSchmitt commented 3 years ago

@jonaswinkler I could use some help getting the build (make-release.sh) run in a docker container. I've tried following the bare-metal guide on https://paperless-ng.readthedocs.io/en/latest/setup.html#overview-of-paperless-ng and build a docker build container based on debian buster (got the same error as quoted below) and then switched to using the Dockerfile from docker/local/Dockerfile and tinkering it a bit. I also had to modify the Pipfile from python_version = "3.6" to use 3.7 instead. I'm just not quite sure I'm doing this correctly ... it feels like I've misunderstood some crucial part of the documentation.

This is my dockerfile:

FROM python:3.7-slim

#Dependencies
RUN apt-get update \
  && apt-get -y --no-install-recommends install \
                build-essential \
                curl \
                ghostscript \
                gnupg \
                icc-profiles-free \
                imagemagick \
                libatlas-base-dev \
                liblept5 \
                libmagic-dev \
                libpoppler-cpp-dev \
                libpq-dev \
                libqpdf-dev \
                libxml2 \
                optipng \
                pngquant \
                qpdf \
                sudo \
                tesseract-ocr \
                tesseract-ocr-eng \
                tesseract-ocr-deu \
                tesseract-ocr-fra \
                tesseract-ocr-ita \
                tesseract-ocr-spa \
                tzdata \
                unpaper \
                zlib1g \
                git \
                pipenv 

ENTRYPOINT ["/bin/bash"]

And this is the error I see:

mo@kenshin ~/src/paperless-ng/scripts/build-container $ docker run -it --rm -v /home/mo:/home/mo -v /var/run/docker.sock:/var/run/docker.sock paperless-ng-buildcontainer:latest 
root@54ba2f259fa5:/# cd /home/mo/src/paperless-ng/
root@54ba2f259fa5:/home/mo/src/paperless-ng# cd scripts/
root@54ba2f259fa5:/home/mo/src/paperless-ng/scripts# ./make-release.sh 0.99
+ VERSION=0.99
+ '[' -z 0.99 ']'
++ git rev-parse --show-toplevel
+ PAPERLESS_ROOT=/home/mo/src/paperless-ng
+ PAPERLESS_DIST=/home/mo/src/paperless-ng/dist
+ PAPERLESS_DIST_APP=/home/mo/src/paperless-ng/dist/paperless-ng
+ PAPERLESS_DIST_DOCKERFILES=/home/mo/src/paperless-ng/dist/paperless-ng-dockerfiles
+ '[' -d /home/mo/src/paperless-ng/dist ']'
+ echo 'Removing /home/mo/src/paperless-ng/dist'
Removing /home/mo/src/paperless-ng/dist
+ rm /home/mo/src/paperless-ng/dist -r
+ mkdir /home/mo/src/paperless-ng/dist
+ mkdir /home/mo/src/paperless-ng/dist/paperless-ng
+ mkdir /home/mo/src/paperless-ng/dist/paperless-ng/docker
+ mkdir /home/mo/src/paperless-ng/dist/paperless-ng/scripts
+ mkdir /home/mo/src/paperless-ng/dist/paperless-ng-dockerfiles
+ cd /home/mo/src/paperless-ng
+ pipenv clean
Creating a virtualenv for this project…
Using /usr/local/bin/python3.7m (3.7.9) to create virtualenv…
⠋Running virtualenv with interpreter /usr/local/bin/python3.7m
Using base prefix '/usr/local'
/usr/lib/python3/dist-packages/virtualenv.py:1090: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
New python executable in /root/.local/share/virtualenvs/paperless-ng-3eJdw5e9/bin/python3.7m
Also creating executable in /root/.local/share/virtualenvs/paperless-ng-3eJdw5e9/bin/python
Installing setuptools, pkg_resources, pip, wheel...done.

Virtualenv location: /root/.local/share/virtualenvs/paperless-ng-3eJdw5e9
Locking [dev-packages] dependencies…
Locking [packages] dependencies…
Updated Pipfile.lock (8aa68f)!
Uninstalling 'pkg-resources'…
+ pipenv install --dev
Installing dependencies from Pipfile.lock (8aa68f)…
An error occurred while installing img2pdf==0.4.0! Will try again.
An error occurred while installing inotify-simple==1.3.5! Will try again.
An error occurred while installing langdetect==1.0.8! Will try again.
An error occurred while installing pdftotext==2.1.5! Will try again.
An error occurred while installing python-levenshtein==0.12.0! Will try again.
An error occurred while installing docopt==0.6.2! Will try again.
An error occurred while installing pytest-env==0.6.2! Will try again.
An error occurred while installing pytest-sugar==0.9.4! Will try again.
An error occurred while installing termcolor==1.1.0! Will try again.
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 115/115 — 00:00:22
Installing initially–failed dependencies…
Success installing img2pdf==0.4.0!▉▉▉ 0/9 — 00:00:00
Success installing inotify-simple==1.3.5! — 00:00:08
Success installing langdetect==1.0.8! 2/9 — 00:00:08
Looking in indexes: https://www.piwheels.org/simple8

ERROR: Could not find a version that satisfies the requirement pdftotext==2.1.5
ERROR: No matching distribution found for pdftotext==2.1.5

  ☤  ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 3/9 — 00:00:08
root@54ba2f259fa5:/home/mo/src/paperless-ng/scripts# 

I'm not quite sure what I'm doing wrong. I'm afraid I don't have a Ubuntu oder Debian available for testing .. I'm on a Gentoo system. I'm used to creating docker containers for this kind of trouble ... so I'm totally puzzled to what went wrong here. Any Ideas? The following excerpt is from a run of make-release.sh within that container: (I also added set -x to see a bit more what's going on)

I can push that and have it run in travis-ci if that helps ... I was just hoping this might suffice.

Any ideas?

jonaswinkler commented 3 years ago

I am not exactly sure why that happens. There's not that much information to go on. That version of pdftotext definitely exists.

Looking in indexes: https://www.piwheels.org/simple8

Seems strange. That should also contain the primary PyPI archive. piwheels only contains armv7 wheels. Might be the reason for the errors.

Anyway, I've created a new branch at https://github.com/jonaswinkler/paperless-ng/tree/travis-multiarch-builds. In this branch, I've updated the Dockerfile so that it builds everything from fresh checkout to working image with one docker build .. Replaces the release script. This might help, since this is pretty much how every other sane docker build works. (And is the same as OG paperless, and that should help you reuse more of your existing configuration files.)

jonaswinkler commented 3 years ago

The main reason for that release script was so that I don't have to include the front end build in the docker file. That causes the image build on raspberry pi to be pretty much impossible. But if we can get this working on travis wit qemu, there's no need for that anymore.

MarkSchmitt commented 3 years ago

@jonaswinkler When you said, that your raspberry Pi builds take a day I was thinking arm32v7 ... but it's probably aarch64. I'm seeing really bad build times too.

You can see some metrics here: https://travis-ci.org/github/MarkSchmitt/paperless/builds/750913744 The arm32v7 build takes about 6 times as long as the x86 one.

The aarch64 build is not finishing in time. Not with qemu, not running natively. (see previous builds on travis - and yes, I've extended the wait time).

To figure out how to speed things up, I'm running the build now locally on a jetson tx2 to get a feel for the time it takes on a sort-of-fast aarch64 processor. I see it compiling lot's of things, but it did finishing in 37m (including docker pulls). Not sure where you got your 24hour build time from then ... that tx2 should be a lot faster than a rpi2 or 3, but not that much faster than a rpi4. It has very fast onboard flash memory though. That also means, that the build on travis-ci might have been feasable, if they just had a faster aarch64 processor ... hmmm.

I've also been googling for some aarch64 wheel(house) .. but with no luck. So far I've found some instructions on how to build one. Maybe that's the solution here .. to actually build a wheelhouse just for aarch64 so we can build paperless-ng with it.

Not sure, I'll keep on tinkering. Edit: I'll try to use the arm64-graviton build machines they have on travis-ci.com ... I don't fully understand what they're doing with switching from .org to .com .. but for some marketingish reason, they only offer the gravitons on the .com variety ... will try to do that later.

jonaswinkler commented 3 years ago

@jonaswinkler When you said, that your raspberry Pi builds take a day I was thinking arm32v7 ... but it's probably aarch64. I'm seeing really bad build times too.

That was before I moved from a python 3.8 to python 3.7 base image. On python 3.8, the wheels for numpy and scipy had to be built from source, since no wheels are available for armv7 and python 3.8. The current image builds fine on Raspberry Pi.

The aarch64 build is not finishing in time. Not with qemu, not running natively. (see previous builds on travis - and yes, I've extended the wait time).

Probably due to the above mentioned issue about no wheels being available, this time for aarch64. I'm not sure what this arm64-graviton is, but I suspect it won't change much.

I'd say the best way to go about this is to wait until wheels are available for aarch64, see https://blog.piwheels.org/raspberry-pi-os-64-bit-aarch64/ and https://github.com/piwheels/piwheels/issues/220. All that is still relatively new.

Even with the armv7 image, we already got a a lot of users covered.

MarkSchmitt commented 3 years ago

That arm64-graviton is a AWS based, custom built arm64 system. It's supposed to be quite fast ... if that's the case, we might be able to finish the build in time for aarch64. I'll give it a try, it might be usable :shrug:

There is also one more thing I'd like to finish before creating a pull request .. I'd like to select the correct arm32v7 image for node:15 and python:3.7-slim automatically. I'm replacing the image manually with sed, because docker cannot be told to pull images for the wrong architecture (there is this buildx system, which is supposed to fix all this .. but I never got that working on linux properly, I think it's still in alpha state). in any case .. with "docker manifest inspect node:15" we can pull the json of the manifest and then parse that to get the proper sha256 for the arch and the image we want ... just need to spend a bit of time with jq (a json parser) to get the syntax right and see where we can get jq from during build :)

obbardc commented 3 years ago

fwiw github actions can build docker containers on new commits or tags without using external CI services; see https://github.com/bubuntux/nordvpn/blob/master/.github/workflows/deploy.yml

also buildah/makisu can build containers without needing docker

jonaswinkler commented 3 years ago

That looks really clean. Thanks for the heads-up.

MarkSchmitt commented 3 years ago

So, I managed to have it build for all three major linux docker archs: x86_64, aarch64 and arm32v7. Only the latter with qemu, the rest natively: https://travis-ci.com/github/MarkSchmitt/paperless/builds/210366873

That graviton2 arm64 platform seems to be really fast. It was just a bit tricky to use it ... it required additional parameters that were rather badly documented.

I've also extended the build so it'll automatically select the proper arm32v7 image when using qemu.

The github actions way seems interesting, but I think there're a couple of problems. I don't think it can be done easily ... right now it will definitely not work until precompiled aarch64 wheels are available. Also not using docker might cause some issues with the build ... I remember that we had some transition problems when we switched build systems (from using docker directly to some docker builder).

My recommendation is this:

  1. we integrate the travis-ci build (which we know works) - I can get a pull request ready for that soonish (today I would hope, there's not much work left, but I don't have much free time today)
  2. see if we can get github actions to do the same job (possibly postpone this until the wheels for aarch64 are ready)
  3. when github actions are proven to work, replace travis-ci

For users of the image, a switch from travis to github actions would not be visible, so I think we could start with that solution and increment on that.

jonaswinkler commented 3 years ago

Travis-ci.com has some new credit system for the 'Free' plan, and these credits get used up rather quickly.

Need to check if they have somethings for OSS projects.

ghaberek commented 3 years ago

Travis-ci.com has some new credit system for the 'Free' plan, and these credits get used up rather quickly.

Need to check if they have somethings for OSS projects.

The travis-ci.com plans page states this:

Free for Open Source Free for open source. We love the Open Source Community, and to show how much we love it, upon validated request placed with our Support Team you may receive free OSS credits for your public builds.

So it seems that that original 10,000 credit limit on the free plan can be increased, and you just need to ask for them.

MarkSchmitt commented 3 years ago

Oh well, fuck me. Here's a good read about what happened to travis-ci. https://www.jeffgeerling.com/blog/2020/travis-cis-new-pricing-plan-threw-wrench-my-open-source-works

I hadn't realized they were going into monetarization, didn't look so in the beta-program I was in. That's really a bummer!

If not ... github actions + perhaps self hosted runner on a native arm64 doing the arm64 build. I might be able to sponsor one, if that can be acceptable and trusted .. if my dark tendencies take over, I could inject malicious stuff. hmpf.

SaraSmiseth commented 3 years ago

Well docker buildx works fine with github actions. I use it for my builds in one of my projects and use the generated images on a raspberry pi.

See this for an example how I build with buildx and github actions.

MarkSchmitt commented 3 years ago

cool, I didn't know docker buildx was working on github actions. the issue with slow aarch64 builds due to missing precompiled wheels remains. But perhaps that's not as big of a deal ... github actions might have longer timeouts (360minutes is default https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes ) and perhaps beefier x86 boxes below. microsoft wants to push this ... well, until they decide to monetize it.

so .. let's try this. and if it's still too slow .. well.. then maybe we need to build our own precompiled wheel repository ...

jonaswinkler commented 3 years ago

then maybe we need to build our own precompiled wheel repository ...

PiWheels is so popular, I'd say that they will eventually start doing aarch64 wheels as well.

MarkSchmitt commented 3 years ago

@SaraSmiseth cool. that worked out of the box! I did remove the separate build step, because it looked like not everything was properly cached and during the push step, it was rebuilding parts of it. is there any particular reason, why you would first build it without push and then later with?

Results of a github action build can be found here: https://hub.docker.com/r/moztr/paperless-ng/tags?page=1&ordering=last_updated And here's the pipeline for it: https://github.com/MarkSchmitt/paperless/runs/1633466350?check_suite_focus=true

It took about 1h40m, not sure how stable that is, needs more test runs.

I'll try to get a condition in for only building on master, dev and ng-* branches/tags..

SaraSmiseth commented 3 years ago

Nice. There is no particular reason to first build and then push later. I just followed the example in the documentation. In my case everything is properly cached and the second step really just pushes the image to docker hub.

jonaswinkler commented 3 years ago

I just gave the armv7 image a test spin and it seems to be working.

MarkSchmitt commented 3 years ago

I tried it on aarch64, looking good. I‘m having some trouble with pytest though. Haven‘t had time to understand yet, what travis is doing differently. One of the errors I see is, that pytest does not accept the „-n auto“ parameter. See https://github.com/MarkSchmitt/paperless/runs/1636277695?check_suite_focus=true I also had to manually install pytest with pip. I think that was setup automatically on travis, so I guess some plugin is missing.

jonaswinkler commented 3 years ago

pipenv lock --dev to also include dependencies for development. Without this, it just installs things that paperless requires to run, and test cases aren't a part of this.

-n is for executing tests in parallel. This is pretty good, since some of the tests of the consumption folder watcher are pretty long.

You could also do a pipenv install --system --dev --ignore-pipfile to skip the intermediate requirements.txt.

MarkSchmitt commented 3 years ago

Thx jonas, didn't have much time, so this took a bit .. I think I managed to get the tests working on with a python build matrix for 3.6, 3.7 and 3.8. https://github.com/MarkSchmitt/paperless/runs/1659197973?check_suite_focus=true

I just need to figure out, how to run the tests first and the docker build step afterwards. :)

jonaswinkler commented 3 years ago

maybe this?

https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#jobsjob_idneeds

MarkSchmitt commented 3 years ago

maybe this?

https://docs.github.com/en/free-pro-team@latest/actions/reference/workflow-syntax-for-github-actions#jobsjob_idneeds

yeah, that looks good https://github.com/MarkSchmitt/paperless/actions/runs/468139954

jonaswinkler commented 3 years ago

Awesome. Can I get an updated PR? I'd like to build the next release with that.

Also, since the majority of the build time is installing python dependencies, I'll look into using something like this, so that the hard stuff is cached between builds. Need to reorganize the Dockerfile for that, so that updates to dependencies other than numpy and friends don't cause this layer to update.

MarkSchmitt commented 3 years ago

Awesome. Can I get an updated PR? I'd like to build the next release with that.

working on it, I still have trouble installing the sphinx dependencies properly. if I do it manually with pip, it works: https://github.com/MarkSchmitt/paperless/runs/1661127479?check_suite_focus=true#step:4:1

but using pipenv install like we currently do in travis doesn't seem to .. I'm not sure why: https://github.com/MarkSchmitt/paperless/runs/1661191801?check_suite_focus=true#step:7:7

we also need to exclude the docker build steps from all non-releasy branches - but I assume you want to have the other stuff, tests, documentation and frontend running on all other commits.

jonaswinkler commented 3 years ago

Actually, docs are built by readthedocs on push, so this isn't really necessary, I guess. And they report failures as well.

jonaswinkler commented 3 years ago

@MarkSchmitt Mind if I take your current progress and try to make it work?

MarkSchmitt commented 3 years ago

@jonaswinkler not at all!

jonaswinkler commented 3 years ago

Getting there.

tido- commented 3 years ago

I am a bit puzzled, over at https://hub.docker.com/r/jonaswinkler/paperless-ng/ I only find OS linux/amd64 (264 MB). However, reading this thread I came along: "However, the resulting archives build fine and reasonably fast on RPi."

In conclusion:

Is this correct as of today?

jonaswinkler commented 3 years ago

I'm working on it.

Do armhf binaries run on aarch64? Similar to how i386 binaries are able to run on amd64 hardware?

tido- commented 3 years ago

I remember doing a 'bare metal' of papermerge on my Rock Pi 4B. I've had to compile this and that, but with a good heat sink it was possible. And yes, I had to install some additional deb's on Debian. However, if you know which one you just enter the command from the documentation and let it run. So, for the begin I guess a nice "how to" would be fine for aarch64 - better than nothing. Could we offer those on Google Drive pre-compiled for the time being?

Do armhf binaries run on aarch64? I don't know. I found some results form 2016 which stated that it works.

I could test it on RPi2 and I have some other ARM boards with better SoC's.

jonaswinkler commented 3 years ago

I'll do a test run of these github actions over night, maybe we'll have some aarch64 binaries tomorrow.

jonaswinkler commented 3 years ago

The github actions workflow is now on the dev branch and happily testing and building images.

Thanks to everyone involved for providing input for this! Thanks especially to @MarkSchmitt, even though much of his work was in vain due to Travis CI settings some ridiculous restrictions on open source projects.