klippa-app / go-pdfium

Easy to use PDF library using Go and PDFium
MIT License
195 stars 16 forks source link

improve doc: how to choose the right parameters for multithreading #70

Closed KaymeKaydex closed 6 months ago

KaymeKaydex commented 1 year ago

how to choose the right parameters for multithreading and what will be the improvement from this?

        MinIdle:  2, // Makes sure that at least x workers are always available
        MaxIdle:  4, // Makes sure that at most x workers are ever available
        MaxTotal: 5,

now I understand their meaning, but I don't understand how to choose them correctly for optimal use

jerbob92 commented 1 year ago

It depends on the amount of resources you want to give pdfium, because it can and will eat up your resources. It also depends on your use-case and resources available.

I always have a min and max of the total amount of cpu cores to ensure that I don't overload the cpu because depending on what you're doing (we are rendering PDF's) every worker can (and will) take 100% of 1 core.

Workers also consume memory depending on the size of the PDF.

Since starting a worker comes with a startup time, you might want to decide to have a minimum amount of workers available so that they don't have to wait the startup time, and you might want to have a higher max to support a sudden peak in load.

But it all depends on the use case so if you have more info I can give you better advice.

KaymeKaydex commented 1 year ago

It depends on the amount of resources you want to give pdfium, because it can and will eat up your resources. It also depends on your use-case and resources available.

I always have a min and max of the total amount of cpu cores to ensure that I don't overload the cpu because depending on what you're doing (we are rendering PDF's) every worker can (and will) take 100% of 1 core.

Workers also consume memory depending on the size of the PDF.

Since starting a worker comes with a startup time, you might want to decide to have a minimum amount of workers available so that they don't have to wait the startup time, and you might want to have a higher max to support a sudden peak in load.

But it all depends on the use case so if you have more info I can give you better advice.

will there be a difference when using a cluster in one thread and using 1 instance with multithreading mode, given that it has more resources? now I'm running in k8s and with an average rpm of 100 in single thread mode it eats about 1 percent. here the question arises: will multithreaded mode be more efficient?

jerbob92 commented 1 year ago

It may be. What are you doing with pdfium?

If you want to know if it would be more efficient you would have to track how long your requests are waiting for the instance to become available from the pool. (so how long does the GetInstance() call take)

KaymeKaydex commented 1 year ago

I only use rendering the first page into an image

jerbob92 commented 1 year ago

I think rendering is the most cpu/memory intensive in pdfium. Check whether the requests are waiting for the instance to become available, and how long they are waiting.

Your process is able to handle more rpm if you use multithreaded, but I would limit the amount of workers based on the amount of cpu/memory that's available on the node.

KaymeKaydex commented 1 year ago

I think rendering is the most cpu/memory intensive in pdfium. Check whether the requests are waiting for the instance to become available, and how long they are waiting.

Your process is able to handle more rpm if you use multithreaded, but I would limit the amount of workers based on the amount of cpu/memory that's available on the node.

now with ~80 rpm there is 100-2000ms reponse time in single thread mode I will try to use multithreading and compare the results

jerbob92 commented 1 year ago

Rendering time depends heavily on the objects in the PDF and the chosen resolution/DPI. I think measuring the time that GetInstance() takes will give you more insights.

KaymeKaydex commented 1 year ago

another question appeared: should I do file after each processed pdf file.Close()?

jerbob92 commented 1 year ago

Yes you have to close the instance to make it available again for the pool. You can do multiple operations on the same instance if your use case allows for it.

KaymeKaydex commented 1 year ago

in my case, I need to process a lot of pdf files that are coming in. when using the approach when I do for every request

    instance, err := pool.GetInstance(time.Second * 30)

    // Always close the document, this will release its resources.
    defer func(instance pdfium.Pdfium, request *requests.FPDF_CloseDocument) {
        _, err = instance.FPDF_CloseDocument(request)
        instance.Close()
    }(instance, &requests.FPDF_CloseDocument{
        Document: doc.Document,
    })

it started working less stably and I started getting errors like Timeout waiting for idle object

my settings:

    min_idle: 15
    max_idle: 30
    max_total: 30

if I don't do instance.Close() it looks like a memory leak on the charts but I'm not an insistent welder, apparently you only need to close the pdf file for the worker :)

jerbob92 commented 1 year ago

document.Close() will release all resources for the opened document, which is helpful if you're doing a lot of operations on the same instance in one go. instance.Close() will also close all opened documents, so if you just handle 1 document per pool.GetInstance() you can also decide to only close the instance.

If you don't do instance.Close(), it will create new instances when you call pool.GetInstance() until it reaches the max_total, that's why it might look like a memory leak.

If you get Timeout waiting for idle object, then it could be that instances aren't returned to the pool, or that your pool is exhausted and it has to wait longer than the timeout for an instance to become available.

KaymeKaydex commented 1 year ago

it turns out that with a large number of pdf file treatments, it is necessary to set up like 100 workers? does the pool restore the worker itself as soon as it is released?

jerbob92 commented 1 year ago

I would not do that, it will cause too much CPU load and will probably get killed because of memory shortage. I would not do more workers than the amount of CPU cores you have if the biggest time of a request is spent on pdfium. Rather look into horizontal scaling your application if you need more than that.

And yes, if you call instance.Close() it will automatically be returned to the pool.

KaymeKaydex commented 1 year ago

one more question, instance.Kill will it return to the pool too?

And yes, if you call instance.Close() it will automatically be returned to the pool.

KaymeKaydex commented 1 year ago

so, I have deduced some approximate formula for my use case, in the coming week I will add examples and a description on how to properly allocate resources for multi-threading

KaymeKaydex commented 1 year ago

I use this library as a web server and use the following configuration

config:
  workers:
    min_idle: 60
    max_idle: 80
    max_total: 100

with such k8s configuration

request_memory: "5Gi"
request_cpu: 1.5
limit_memory: "6Gi"
limit_cpu: 2.0

and horizontal scaling replicas: 10

after each time the worker processes the pdf into an image, I close the worker despite this, with a heavy load, I still see a drop in services I temporarily solved this qos problem by configuring k8s and I also noticed troubles with garbage collector, which is why I was forced to use sync.Pool for byte buffers where I put the image When using a small number of workers , the pool becomes unusable with the error waiting for idle timeout - this usually happens once a week; according to my assumption, there is still a problem that manages to hide from me, but I continue my research

KaymeKaydex commented 1 year ago

so far, according to my assumption, troubles arise when hiking on rpc through the hashicorp library, and I'm also worried that each new worker is a weekly process, perhaps it's worth using the fork mechanism for linux for possible acceleration and better utilization of resources?

jerbob92 commented 1 year ago

We use this in production too on Kubernetes and process millions of PDFs with it.

KaymeKaydex commented 1 year ago
  • I think that 6GB memory limit is not enough for 100 workers.

Yes, I think so too, so we came to a solution with k8s QoS

  • If you're getting timeouts, could it be that you're processing PDF's that hang in pdfium (and thus never return?) We have some logic on our side to catch such cases and kill the process when it takes too long

yes, I also close the worker, only not during timeouts, but after each file processing

        err = instance.Close()
        if err != nil {
            log.Error().Err(err).Msg("cant close instance")
            m.IncInstanceCloseErr(err)

            err = instance.Kill()
            if err != nil {
                log.Ctx(ctx).Error().Err(err).Msg("cant kill instance; smth very bad happen")
            }
        }

Can you tell me why the RPC would be the issue here? I did not see any issues on that side of the implementation (yet)

Regarding RPC, my assumption is solely based on the removed profile of the service, where there were two main problems:

  1. Allocating a byte buffer for each request and the garbage collector goes crazy from this. I used sync.Pool and noticed significant improvements
  2. Buffer on rpc call

Can you tell me why fork would work better here? And how that could be implemented in a process like this to utilize resources better?

I'll try to come back with pr a little later. at least using the process now somewhat resembles the Apache Web Server model - maybe this is a mistake of my implementation

Can you tell me how you're rendering the images? If you use RenderToFile, you can win some time because it doesn't have to send over the full raw image data over RPC between the processes.

sounds like a very cool idea, I'll try to implement it in my web server, thank you very much :)

in general, according to my observations, about 120 workers are needed for 60 rps; if this is not secret information, please tell me in which configuration and for which rps you use the library? And is it right to kill the worker after the processed pdf file? Or should I always do Close and only for a special error checked through errors.Is () do Kill()?

KaymeKaydex commented 1 year ago

and now I'm also checking errors.As on idle timeout from the pool and in this case I will recreate it

jerbob92 commented 1 year ago

Yes, I think so too, so we came to a solution with k8s QoS

But that's not really a solution, K8S will just kill the pdfium processes if you go over the limit

yes, I also close the worker, only not during timeouts, but after each file processing

That's not what I mean, in some cases pdfium can hang on a PDF and never return, so there won't be "after each file processing", and the worker will never be returned to the pool, unless you monitor it from a separate goroutine and kill the worker if it takes too long. That's not something we can solve in the library, but it would be possible to build this monitoring inside the library.

Is there a way for you to see if this is what happened?

in general, according to my observations, about 120 workers are needed for 60 rps; if this is not secret information, please tell me in which configuration and for which rps you use the library? And is it right to kill the worker after the processed pdf file? Or should I always do Close and only for a special error checked through errors.Is () do Kill()?

We have CPU based autoscaling on our pods, and our pods are configured to have a maximum of 4 pdfium workers, our QoS is a memory request of 4GB and a limit of 8GB. It really depends on the type of PDF and the chosen render size how much memory will be used.

You should only do Close() in normal cases. The pool manager already restarts the worker itself in case of internal worker errors.

We have wrapped our GetInstance() with the following code:

func GetInstance(killTimeout time.Duration) (pdfium.Pdfium, func(), error) {
    pdfiumInstance, err := pdfiumPool.GetInstance(time.Second * 30)
    if err != nil {
        return nil, nil, err
    }

    isClosed := false

    // Time after which the process should be killed.
    killTimer := time.NewTimer(killTimeout)
    go func() {
        <-killTimer.C
        if isClosed {
            return
        }
        isClosed = true
        pdfiumInstance.Kill()
    }()

    closeInstance := func() {
        killTimer.Stop()
        if isClosed {
            return
        }
        isClosed = true
        pdfiumInstance.Close()
    }

    return pdfiumInstance, closeInstance, nil
}

This will kill the process if it's still not closed after killTimeout.

I have created a small webserver example here: https://gist.github.com/jerbob92/b3d94530944ec5e69f71a7dacbb6e695 It renders the first page of a PDF in the given DPI. With this example, rendering a simple PDF on 200 DPI takes about 75ms, it has 4 workers, which means it can do about 50rps, with only 1 instance. The stats in the real world completely depends on the content of the PDFs of course, but just to give you an idea.

KaymeKaydex commented 6 months ago

Thank you very much for the help. I also took a look at your repositories, it helped me get the web server up and running. I think the issue can be closed.

In this configuration, everything works perfectly, except for pdfium sticking