libvips / pyvips

python binding for libvips using cffi
MIT License
649 stars 50 forks source link

Memory leaking with pdf #450

Open kar-pev opened 10 months ago

kar-pev commented 10 months ago

I'm trying to convert pdf to png, do some stuff with image and then upload result to s3, which needs to save image in buffer or like file. Everything works fine, except the part with saving. During this part, memory usage starts to grow too much (like 1.5 Gb). I used pyvips.cache_set_max(0) && pyvips.cache_set_max_mem(0), but nothing helps. Experimentally I understood, that it's all caused by convertion from pdf to image (maybe some metadata or smth like this). Main func:

def image_conversion(
        self, first_page=0, last_page=None, dpi=600, expected_size=(2048, 2048)
    ):
        """
        Convert pdf page to image and save it
        ""'
        for page in range(self.first_page, self.last_page + 1):
            try:
                # convert pdf page to image and cut alpha band
                img = pyvips.Image.new_from_file(
                    self.filename, page=page,
                    memory=True, dpi=dpi, fail=True
                )[:3]
            except Exception as ex:
                warning(f"Error on page {page}: {ex.__str__()}")
                continue
            else:
                info(f"page {page} sucessfully converted")

            img, padding_w, padding_h = resize_with_padding(img, expected_size)

           # some usefull stuff with saving values in class variables

            save_image_to_s3(img, os.path.join(
                self.data_dir, "images", f"{page}.png"), name=str(page)
            )
        save_json_to_s3(self.content, os.path.join(self.data_dir, "metafile.json"))
        self._remove_file()

resizing func:

from pyvips.enums import Extend

def resize_with_padding(img, expected_size):
    img = img.thumbnail_image(expected_size[0], height=expected_size[1])
    delta_width = expected_size[0] - img.get("width")
    delta_height = expected_size[1] - img.get("height")
    pad_width = delta_width // 2
    pad_height = delta_height // 2
    return (
        img.embed(
            pad_width,  # x
            pad_height,  # y
            expected_size[0],  # width
            expected_size[1],  # height
            extend=Extend.WHITE,
        ),
        pad_width,
        pad_height,
    )

saving func:

def save_image_to_s3(image, file_name):
    """
    Save image to S3
    """
    data = image.write_to_buffer(".png", bitdepth=8, strip=True, compression=9)
    s3object = s3.Object(get_var("S3_BUCKET"), file_name)
    s3object.put(Body=data, ContentType="image/png")

Pdf is about 120 mb and concludes of 132 pages. Byte string size is less than 1 mb (about 0.5) Result images size is 2048x2048

docker container: pyvips==2.2.2 libvips-dev/stable 8.14.1-3+deb12u1 amd64
libvips42/stable 8.14.1-3+deb12u1 amd64

PS also tried to use write_to_file instead of buffer and then upload file, but result was approximately same

jcupitt commented 10 months ago

Hi @kar-pev,

You'll need to make a complete test program and a sample PDF before I can reproduce your problem.

I tried here with this benchmark:

#!/usr/bin/python3

import sys
import pyvips

image = pyvips.Image.pdfload(sys.argv[1])
for i in range(image.get("n-pages")):
    # load to fill a 2048 x 2048 box
    page = pyvips.Image.thumbnail(sys.argv[1] + f"[page={i}]", 2048)[:3]
    page = page.gravity("centre", 2048, 2048, extend="white")
    data = page.write_to_buffer(".png")
    print(f"rendered as {len(data)} of PNG")

Don't use thumbnail_image, it's only for emergencies, instead load directly with thumbnail. It'll calculate a DPI for you and render at the correct size immediately.

With this PDF I see:

$ /usr/bin/time -f %M:%e ./try297.py ~/pics/nipguide.pdf 
rendered page 0 as 60533 bytes of PNG
rendered page 1 as 17417 bytes of PNG
rendered page 2 as 170833 bytes of PNG
rendered page 3 as 208890 bytes of PNG
rendered page 4 as 162163 bytes of PNG
rendered page 5 as 17625 bytes of PNG
rendered page 6 as 70013 bytes of PNG
rendered page 7 as 17891 bytes of PNG
rendered page 8 as 200018 bytes of PNG
rendered page 9 as 27530 bytes of PNG
rendered page 10 as 881459 bytes of PNG
rendered page 11 as 857038 bytes of PNG
rendered page 12 as 658160 bytes of PNG
rendered page 13 as 726185 bytes of PNG
rendered page 14 as 851090 bytes of PNG
rendered page 15 as 494613 bytes of PNG
rendered page 16 as 496943 bytes of PNG
rendered page 17 as 471977 bytes of PNG
rendered page 18 as 387558 bytes of PNG
rendered page 19 as 581401 bytes of PNG
rendered page 20 as 433491 bytes of PNG
rendered page 21 as 499961 bytes of PNG
rendered page 22 as 557588 bytes of PNG
rendered page 23 as 30114 bytes of PNG
rendered page 24 as 413044 bytes of PNG
rendered page 25 as 548221 bytes of PNG
rendered page 26 as 618224 bytes of PNG
rendered page 27 as 470931 bytes of PNG
rendered page 28 as 503389 bytes of PNG
rendered page 29 as 534815 bytes of PNG
rendered page 30 as 410243 bytes of PNG
rendered page 31 as 112163 bytes of PNG
rendered page 32 as 384436 bytes of PNG
rendered page 33 as 443909 bytes of PNG
rendered page 34 as 490428 bytes of PNG
rendered page 35 as 450758 bytes of PNG
rendered page 36 as 450702 bytes of PNG
rendered page 37 as 316399 bytes of PNG
rendered page 38 as 354259 bytes of PNG
rendered page 39 as 387184 bytes of PNG
rendered page 40 as 254847 bytes of PNG
rendered page 41 as 426819 bytes of PNG
rendered page 42 as 186205 bytes of PNG
rendered page 43 as 400066 bytes of PNG
rendered page 44 as 380871 bytes of PNG
rendered page 45 as 388221 bytes of PNG
rendered page 46 as 277809 bytes of PNG
rendered page 47 as 399227 bytes of PNG
rendered page 48 as 212261 bytes of PNG
rendered page 49 as 366544 bytes of PNG
rendered page 50 as 467029 bytes of PNG
rendered page 51 as 518713 bytes of PNG
rendered page 52 as 420412 bytes of PNG
rendered page 53 as 206056 bytes of PNG
rendered page 54 as 86764 bytes of PNG
rendered page 55 as 27297 bytes of PNG
rendered page 56 as 379708 bytes of PNG
rendered page 57 as 164827 bytes of PNG
528840:8.94

So 57 pages in 9s, with a peak memory use of 530kb.

Rather than saving to a buffer and then uploading, you can upload directly to S3 with page.write_to_target(), there's some sample code in examples/. It might save a little time and memory.

kar-pev commented 10 months ago

Thx, maybe problem was with not using pdfload too. I'll try myself and write feedback soon

jcupitt commented 10 months ago

pdfload won't make any difference, I was just trying to be clear. If you use new_from_file it'll work for any multi-page format, eg. GIF etc.

kar-pev commented 10 months ago

Actually, about thumbnail, as I see in docs: static thumbnail(filename, ...). It uses filename, so I should have image in memory and provide it's path, but with thumbnail I could use object. Is there a way to use object in regular thumbnail?

jcupitt commented 10 months ago

There's thumbnail_buffer() to thumbnail image data held as a string-like object, if that's what you mean.

jcupitt commented 10 months ago

Don't resize, just load at the correct size with thumbnail, then do any padding with gravity. You'll get better quality, lower memory use, and it'll be quicker too.

kar-pev commented 10 months ago

I've tried and it works, max memory usage is programm size + pdf size, which seems awesome. Probably, issue was exactly with trying to save each page as object in memory. And speed increased too even without target write. Thanks a lot

jcupitt commented 10 months ago

GreatGreat!

(tangent, but it wasn't a memory leak -- that implies a memory reference has been lost due to a bug -- you were just seeing unexpectedly high memuse due to your program design)

kar-pev commented 10 months ago

Sorry, I haven't tried it earlier, but I've ran your code in container and had 2+ Gb of using memory. I think, that it could be docker daemon problem, but I'm not sure about it and absolutely don't know how to fix it with using pyvips. Memuse growth exactly with write_to_buffer func, but without it I have memuse value, that is ok for task

jcupitt commented 10 months ago

Maybe your container isn't using popper to load PDFs, but is falling back to imagemagick? But that's just a guess, you need to share complete examples I can try before I can help.

kar-pev commented 10 months ago

docker-compose.yaml file:

services:
  servise_name:
    restart: always
    build:
      context: .
      dockerfile: Dockerfile
    container_name: name
    image: name

Dockerfile:

FROM python:3.9-slim

RUN apt update && apt install -y \
    libmupdf-dev \
    libfreetype6-dev \
    libjpeg-dev \
    libglib2.0-0 \
    libgl1-mesa-glx \
    libpq-dev \
    libvips-dev --no-install-recommends \
    poppler-utils \
    gcc

WORKDIR /app

COPY . .

RUN python3 -m pip install --no-cache-dir -r requirements.txt

CMD ["python3", "main.py"]

will be enough to have only pyvips in requirements for this example

main.py file:

import pyvips

def main():
    image = pyvips.Image.pdfload("file.pdf")
    for i in range(image.get("n-pages")):
        # load to fill a 2048 x 2048 box
        page = pyvips.Image.thumbnail("file.pdf" + f"[page={i}]", 2048)[:3]
        page = page.gravity("centre", 2048, 2048, extend="white")
        data = page.write_to_buffer(".png")
        print(f"rendered page {i} as {len(data)} bytes of PNG")

if __name__ == "__main__":
    main()

It's all I'm using to get 2G+ of memuse (+ pdf file)

jcupitt commented 10 months ago

There's a lot of stuff you don't need in that dockerfile, I'd just have:

FROM python:3.9-slim

RUN apt-get update \
  && apt-get install -y \
        build-essential \
        pkg-config 

RUN apt-get -y install --no-install-recommends libvips-dev

RUN pip install pyvips 

I made a test dir:

https://github.com/jcupitt/docker-builds/tree/master/pyvips-python3.9

If I run:

docker build -t pyvips-python3.9 .
docker run -it --rm -v $PWD:/data pyvips-python3.9 ./main.py nipguide.pdf

I see a fairly steady c. 400mb in top.

kar-pev commented 10 months ago

I'm copying your code (with CMD ["python", "main.py"] at the end of dockerfile) and getting 1.5 G+. Could you please try with my file

jcupitt commented 10 months ago

Yes, I see about 1.5g with that file too. It has huge image overlays on every page, so I think that's to be expected. It's just a very heavy PDF.

kar-pev commented 10 months ago

But images + pdf uses much less memory, why other part of used memory is reserved and wasn't freed? It seems like some cache, that hasn't been cleared, because memuse stays on ~1G after process

jcupitt commented 10 months ago

I would guess it's memory fragmentation from handling those huge overlays.

jcupitt commented 10 months ago

How are you measuring memory use? RES in top is probably the most useful.

kar-pev commented 10 months ago

I'm using docker stats. Could I clear or free some of this allocated memory? I'm trying to reduce memory limits for container

jcupitt commented 10 months ago

I guess you could use the cli instead:

#!/bin/bash

pdf=$1
n_pages=$(vipsheader -f n-pages $pdf)

for ((i=0; i < n_pages; i++)); do 
  echo processing page $i ...
  vipsthumbnail $pdf[page=$i] --size 2048 -o t1.v
  vips extract_band t1.v t2.v 0 --n 3
  vips gravity t2.v page-$i.png centre 2048 2038 --extend white
done

rm t1.v t2.v

It's a bit slower though, and you'll see a lot of block IO.

jcupitt commented 10 months ago

Another option would be to use a malloc that avoids fragmentation, like jemalloc.

https://jemalloc.net/

But that's harder to set up.

kar-pev commented 10 months ago

Ok, thanks, I'll try one of those