Please help me test the new multiarch docker images

jonaswinkler commented 3 years ago

I've got the CI pipeline (see #151) pretty much ready, and it has successfully built docker images for amd64, armhf and aarch64.

Image is available at Docker Hub. For anyone interested, the workflow that produced these images is here: https://github.com/jonaswinkler/paperless-ng/actions/runs/476333808.

I don't have aarch64 hardware and would love to hear from people who do if this works. Feedback on the arm/v7 image is also welcome.

These images are based on the latest dev branch, which is identical to the current release + a couple bug fixes. But as with all pre-release things, I wouldn't advise to run that with your actual database.

These images can be used with any of the docker-compose files in the docker/hub/ folder. Just replace the version, and pull.

Things I'd like to see tested:

Consume digital PDF documents with embedded text
Consume scanned PDF documents without embedded text
Consume JPG documents
Add some "Auto" matching metadata to documents and inspect whether the "Train the classifier" scheduled task executes successfully. You can schedule that to run immediately by going into the admin, editing that scheduled task, and clicking Today / Now further down, then save. To make sure that it's working, you should evenutally see something like this in the logs with the filter set to DEBUG:
Try to find some documents with the full text search.

Thank you!

jonaswinkler commented 3 years ago

One thing I've already spotted with OCRmyPDF

WARNING 2021-01-11 12:49:04,913 tesseract [tesseract] took too long to OCR - skipping

mvdkleijn commented 3 years ago

Nice! :-)

raspberry pi 4
ubuntu 20.04 for arm / raspberry pi
docker 19.03.14 CE
docker compose 1.25.0
documents stored on ssd

It starts, no obvious problems. Logging in works fine.

My workflow is using the paperless android app mostly. On occasion I get a pdf by email that I add, but not often.

An initial attempt to scan a document using the app results in a Python PIL related error on paperless-ng.

cannot import name '_imagingcms' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/__init__.py)

jonaswinkler commented 3 years ago

Thank you. Is that the 32bit or 64bit variant of ubuntu?

jonaswinkler commented 3 years ago

See https://github.com/python-pillow/Pillow/issues/5202

mvdkleijn commented 3 years ago

Thank you. Is that the 32bit or 64bit variant of ubuntu?

64-bit

jonaswinkler commented 3 years ago

@mvdkleijn new build is up on the hub, does that resolve the issue?

niarbx commented 3 years ago

Hi Jonas,

I tested the new ARM64 Image on Raspberry Pi 4 4GB on latest Raspbian (now called Rasppery Pi OS) 64 Bit.

Consumed serveral scanned PDFs
Consumed digitally created PDFs
Created tags, correspondents and document types
Consumed JPGs with and without text

I encountered the following:

while consuming a PDF wich consists of a big image the following message was logged:
- ERROR Error while consuming document PDFWithImage.pdf: cannot import name '_imagingcms' from 'PIL' (/usr/local/lib/python3.7/site-packages/PIL/__init__.py)
An Image with Text threw the following Warning: WARNING Error while getting DPI from image
- This doesnt seem to be a problem, image was OCRed correctly
Classifier also works, training and auto-matching worked (while training didnt have accurate results because of too small training data).

By the way, I'm using an arm64 image for about a week now in "production". I extended the Dockerfile by another stage to download the sources (so I dont have to checkout the sources every time a new release is ready) and used python:3.9-slim as base image. Works without any errors so far.

Best regards, Tobi

mvdkleijn commented 3 years ago

@jonaswinkler The latest image consumes the document just fine. I do get the samr DPI warning that @niarbx got but other than that it looks fine.

jonaswinkler commented 3 years ago

An Image with Text threw the following Warning: WARNING Error while getting DPI from image

That shouldn't be a warning, I'll lower the severity. Some images have DPI information in their metadata, and paperless uses that. That's important for PDF generation (how big should the pages be?). If none is available, paperless will produce A4-sized PDF documents.

while consuming a PDF wich consists of a big image the following message was logged

Should be fixed in the image from a couple minutes ago

Classifier also works, training and auto-matching worked (while training didnt have accurate results because of too small training data).

Thank you, good to know.

mvdkleijn commented 3 years ago

Consumes PDFs with and without embedded text just fine
Consumes JPGs with text just fine (did not test text-less JPGs)
Classifier, training and auto-matching works fine. Training accuracy was fairly high since I had about 30 similar documents for it to train on.
Tags are fine
Correspondents are fine
Document types are fine
User creation is fine (though permission management is somewhat non-trivial)

mannp commented 3 years ago

Testing the docker build with Unraid and consumed a couple of files with no problem.

I threw a selection of scanned pdf's at it (15) and I've lost the gui, not reachable .... no obvious errors in the log, in fact it appears to be still consuming.

Only just found your NG version of paperless today so will take a better look at my config tomorrow to see if it needs tuning.

Cool NG version btw :)

jonaswinkler commented 3 years ago

What platform? It might take up all resources while consuming (this takes a long time on Pi), and the web server might not get enough cpu time to provide a response in time.

Consider the option TASK_WORKERS and THREADS_PER_WORKER (https://paperless-ng.readthedocs.io/en/latest/configuration.html#software-tweaks). Pi3/4 have a quad core, therefore settings WORKERS=2, THREADS=1 will always leave some resources available for other tasks.

mannp commented 3 years ago

What platform? It might take up all resources while consuming (this takes a long time on Pi), and the web server might not get enough cpu time to provide a response in time.

Consider the option TASK_WORKERS and THREADS_PER_WORKER (https://paperless-ng.readthedocs.io/en/latest/configuration.html#software-tweaks). Pi3/4 have a quad core, therefore settings WORKERS=2, THREADS=1 will always leave some resources available for other tasks.

See also https://paperless-ng.readthedocs.io/en/latest/setup.html#considerations-for-less-powerful-devices

Thanks for the info, Unraid machine is a Xeon with 32g of memory running multiple dockers ..

jonaswinkler commented 3 years ago

Uhm, yeah. That should not have any issues running this.

sisao commented 3 years ago

It's running on armv7 (Banana Pi M2U)

OS: Armbian (Ubuntu 20.04.1 LTS) Kernel: Linux dms 5.9.14-sunxi #20.11.3 SMP Fri Dec 11 20:31:12 CET 2020 armv7l armv7l armv7l GNU/Linux Docker: 19.03.12 docker-compose: 1.27.4

No errors so far. Consuming Email with attachment works, scanned pdf consuming works, full text search works, training of classifier starts and works.

jonaswinkler commented 3 years ago

Alright, thank you very much. Multi arch images are coming soon.

jonaswinkler / paperless-ng

Please help me test the new multiarch docker images #322