Open kar-pev opened 10 months ago
Hi @kar-pev,
You'll need to make a complete test program and a sample PDF before I can reproduce your problem.
I tried here with this benchmark:
#!/usr/bin/python3
import sys
import pyvips
image = pyvips.Image.pdfload(sys.argv[1])
for i in range(image.get("n-pages")):
# load to fill a 2048 x 2048 box
page = pyvips.Image.thumbnail(sys.argv[1] + f"[page={i}]", 2048)[:3]
page = page.gravity("centre", 2048, 2048, extend="white")
data = page.write_to_buffer(".png")
print(f"rendered as {len(data)} of PNG")
Don't use thumbnail_image
, it's only for emergencies, instead load directly with thumbnail
. It'll calculate a DPI for you and render at the correct size immediately.
With this PDF I see:
$ /usr/bin/time -f %M:%e ./try297.py ~/pics/nipguide.pdf
rendered page 0 as 60533 bytes of PNG
rendered page 1 as 17417 bytes of PNG
rendered page 2 as 170833 bytes of PNG
rendered page 3 as 208890 bytes of PNG
rendered page 4 as 162163 bytes of PNG
rendered page 5 as 17625 bytes of PNG
rendered page 6 as 70013 bytes of PNG
rendered page 7 as 17891 bytes of PNG
rendered page 8 as 200018 bytes of PNG
rendered page 9 as 27530 bytes of PNG
rendered page 10 as 881459 bytes of PNG
rendered page 11 as 857038 bytes of PNG
rendered page 12 as 658160 bytes of PNG
rendered page 13 as 726185 bytes of PNG
rendered page 14 as 851090 bytes of PNG
rendered page 15 as 494613 bytes of PNG
rendered page 16 as 496943 bytes of PNG
rendered page 17 as 471977 bytes of PNG
rendered page 18 as 387558 bytes of PNG
rendered page 19 as 581401 bytes of PNG
rendered page 20 as 433491 bytes of PNG
rendered page 21 as 499961 bytes of PNG
rendered page 22 as 557588 bytes of PNG
rendered page 23 as 30114 bytes of PNG
rendered page 24 as 413044 bytes of PNG
rendered page 25 as 548221 bytes of PNG
rendered page 26 as 618224 bytes of PNG
rendered page 27 as 470931 bytes of PNG
rendered page 28 as 503389 bytes of PNG
rendered page 29 as 534815 bytes of PNG
rendered page 30 as 410243 bytes of PNG
rendered page 31 as 112163 bytes of PNG
rendered page 32 as 384436 bytes of PNG
rendered page 33 as 443909 bytes of PNG
rendered page 34 as 490428 bytes of PNG
rendered page 35 as 450758 bytes of PNG
rendered page 36 as 450702 bytes of PNG
rendered page 37 as 316399 bytes of PNG
rendered page 38 as 354259 bytes of PNG
rendered page 39 as 387184 bytes of PNG
rendered page 40 as 254847 bytes of PNG
rendered page 41 as 426819 bytes of PNG
rendered page 42 as 186205 bytes of PNG
rendered page 43 as 400066 bytes of PNG
rendered page 44 as 380871 bytes of PNG
rendered page 45 as 388221 bytes of PNG
rendered page 46 as 277809 bytes of PNG
rendered page 47 as 399227 bytes of PNG
rendered page 48 as 212261 bytes of PNG
rendered page 49 as 366544 bytes of PNG
rendered page 50 as 467029 bytes of PNG
rendered page 51 as 518713 bytes of PNG
rendered page 52 as 420412 bytes of PNG
rendered page 53 as 206056 bytes of PNG
rendered page 54 as 86764 bytes of PNG
rendered page 55 as 27297 bytes of PNG
rendered page 56 as 379708 bytes of PNG
rendered page 57 as 164827 bytes of PNG
528840:8.94
So 57 pages in 9s, with a peak memory use of 530kb.
Rather than saving to a buffer and then uploading, you can upload directly to S3 with page.write_to_target()
, there's some sample code in examples/
. It might save a little time and memory.
Thx, maybe problem was with not using pdfload too. I'll try myself and write feedback soon
pdfload
won't make any difference, I was just trying to be clear. If you use new_from_file
it'll work for any multi-page format, eg. GIF etc.
Actually, about thumbnail, as I see in docs: static thumbnail(filename, ...). It uses filename, so I should have image in memory and provide it's path, but with thumbnail I could use object. Is there a way to use object in regular thumbnail?
There's thumbnail_buffer()
to thumbnail image data held as a string-like object, if that's what you mean.
Don't resize, just load at the correct size with thumbnail
, then do any padding with gravity
. You'll get better quality, lower memory use, and it'll be quicker too.
I've tried and it works, max memory usage is programm size + pdf size, which seems awesome. Probably, issue was exactly with trying to save each page as object in memory. And speed increased too even without target write. Thanks a lot
GreatGreat!
(tangent, but it wasn't a memory leak -- that implies a memory reference has been lost due to a bug -- you were just seeing unexpectedly high memuse due to your program design)
Sorry, I haven't tried it earlier, but I've ran your code in container and had 2+ Gb of using memory. I think, that it could be docker daemon problem, but I'm not sure about it and absolutely don't know how to fix it with using pyvips. Memuse growth exactly with write_to_buffer func, but without it I have memuse value, that is ok for task
Maybe your container isn't using popper to load PDFs, but is falling back to imagemagick? But that's just a guess, you need to share complete examples I can try before I can help.
docker-compose.yaml file:
services:
servise_name:
restart: always
build:
context: .
dockerfile: Dockerfile
container_name: name
image: name
Dockerfile:
FROM python:3.9-slim
RUN apt update && apt install -y \
libmupdf-dev \
libfreetype6-dev \
libjpeg-dev \
libglib2.0-0 \
libgl1-mesa-glx \
libpq-dev \
libvips-dev --no-install-recommends \
poppler-utils \
gcc
WORKDIR /app
COPY . .
RUN python3 -m pip install --no-cache-dir -r requirements.txt
CMD ["python3", "main.py"]
will be enough to have only pyvips in requirements for this example
main.py file:
import pyvips
def main():
image = pyvips.Image.pdfload("file.pdf")
for i in range(image.get("n-pages")):
# load to fill a 2048 x 2048 box
page = pyvips.Image.thumbnail("file.pdf" + f"[page={i}]", 2048)[:3]
page = page.gravity("centre", 2048, 2048, extend="white")
data = page.write_to_buffer(".png")
print(f"rendered page {i} as {len(data)} bytes of PNG")
if __name__ == "__main__":
main()
It's all I'm using to get 2G+ of memuse (+ pdf file)
There's a lot of stuff you don't need in that dockerfile, I'd just have:
FROM python:3.9-slim
RUN apt-get update \
&& apt-get install -y \
build-essential \
pkg-config
RUN apt-get -y install --no-install-recommends libvips-dev
RUN pip install pyvips
I made a test dir:
https://github.com/jcupitt/docker-builds/tree/master/pyvips-python3.9
If I run:
docker build -t pyvips-python3.9 .
docker run -it --rm -v $PWD:/data pyvips-python3.9 ./main.py nipguide.pdf
I see a fairly steady c. 400mb in top
.
I'm copying your code (with CMD ["python", "main.py"] at the end of dockerfile) and getting 1.5 G+. Could you please try with my file
Yes, I see about 1.5g with that file too. It has huge image overlays on every page, so I think that's to be expected. It's just a very heavy PDF.
But images + pdf uses much less memory, why other part of used memory is reserved and wasn't freed? It seems like some cache, that hasn't been cleared, because memuse stays on ~1G after process
I would guess it's memory fragmentation from handling those huge overlays.
How are you measuring memory use? RES
in top
is probably the most useful.
I'm using docker stats. Could I clear or free some of this allocated memory? I'm trying to reduce memory limits for container
I guess you could use the cli instead:
#!/bin/bash
pdf=$1
n_pages=$(vipsheader -f n-pages $pdf)
for ((i=0; i < n_pages; i++)); do
echo processing page $i ...
vipsthumbnail $pdf[page=$i] --size 2048 -o t1.v
vips extract_band t1.v t2.v 0 --n 3
vips gravity t2.v page-$i.png centre 2048 2038 --extend white
done
rm t1.v t2.v
It's a bit slower though, and you'll see a lot of block IO.
Another option would be to use a malloc that avoids fragmentation, like jemalloc.
But that's harder to set up.
Ok, thanks, I'll try one of those
I'm trying to convert pdf to png, do some stuff with image and then upload result to s3, which needs to save image in buffer or like file. Everything works fine, except the part with saving. During this part, memory usage starts to grow too much (like 1.5 Gb). I used
pyvips.cache_set_max(0) && pyvips.cache_set_max_mem(0)
, but nothing helps. Experimentally I understood, that it's all caused by convertion from pdf to image (maybe some metadata or smth like this). Main func:resizing func:
saving func:
Pdf is about 120 mb and concludes of 132 pages. Byte string size is less than 1 mb (about 0.5) Result images size is 2048x2048
docker container: pyvips==2.2.2 libvips-dev/stable 8.14.1-3+deb12u1 amd64
libvips42/stable 8.14.1-3+deb12u1 amd64
PS also tried to use write_to_file instead of buffer and then upload file, but result was approximately same