Tested for concurrency?

dtoso-skymesh commented 9 years ago

In the past running multiple copies of wkhtmltopdf concurrently had issues; there were threading problems and some named-pipes were at the same fs path across multiple wkhtmltopdf processes. (I found out after an ugly incident involving customers getting other's PDFs in a parallelized batch run).

wkhtmltopdf seems to have had many version bumps since then, but nothing I've read from the commits screams out that this issue has been fixed.

In openlabs/docker-wkhtmltopdf-aas the gunicorn WSGI daemon seems to fork on request, so if the concurrency issue still exists in wkhtmltopdf then this service exports the problem to the service's users.

In my use case, I needed to substitute the version of wkhtmltopdf shipped with openlabs/docker-wkhtmltopdf-aas with a staticly linked copy of wkhtmltopdf 0.10.0rc2 because the PDF output from identical HTML had changed over the years due to webkit html rendering fixes. (I have legacy HTML that would be a massive PITA to change).

As I know at least my version of wkhtmltopdf (0.10.0rc2) has concurrency issues, I'm treating docker as an isolation mechanism rather than simply a deployment helper. I have 20 identical containers running with a home-made HTTP load-balancing proxy sitting in front of them. It hands off (unmodifed) requests to available containers and makes subsequent requests wait until workers become available (by simply blocking on the HTTP response).

sharoonthomas commented 9 years ago

Testing the returned content in PDF is a PITA. Any ideas on how a test with concurrency could be done ?

dtoso-skymesh commented 9 years ago

I wrote a perl script (call it 'single.pl') that:

generates a random ID and MD5s it,
prints the <MD5> without a newline
takes known, simple HTML and substitutes the <MD5> into the that HTML.
makes a JSON-mode HTTP POST request to the service
compares the output to expected <MD5> using pdftotext from poppler-utils

Comparison done through this pipleline:

pdftotext - - | grep <MD5>

Then I wrote another perl script (call it 'bench.pl') to fork 5 children, where each child executes single.pl 20 times with a randomised Time::HiRes::usleep in between requests. I log the commandline and the result of the grep out to a file and then grep that for mismatches.

sharoonthomas commented 9 years ago

@dtoso-skymesh :+1: thank you

alicpr commented 3 years ago

We are going to use this on enterprise scale which will perform 100 req/s on each server. Is the issue still exists? Does any alternative solution available?

dtoso-skymesh commented 3 years ago

@alicpr not sure if @sharoonthomas has fixed this issue, but I worked around it by running many docker containers each running wkhtmltopdf-aas. The solution was to only send one request at a time to each container.

If you've got a fast enough machine(s) you could just launch (docker run) them on demand from, say, a CGI script.

On our hardware that wasn't fast enough so I came up with an HTTP-proxy based solution. Basically it does:

at startup & periodically: uses docker socket protocol to ask for a list available docker containers running wkhtmltopdf-aas, along with their NAT'd IP address and port.
uses select(2) to respond to large numbers of requesting HTTP clients -- they see the proxy as a blocking server
maintains a mapping of active docker containers to client sockets, blocking an HTTP client if none is available
when a container becomes available (new container or previously completed/aborted request) the request is forwarded in a non-blocking manner to the mapped docker container's IP address and port.
responses to container HTTP requests are forwarded back to the clients in a non-blocking fashion
when a client-response is completed, the container mapping is removed to service requests for other clients

I've found the limiting factor to be the server hardware.

openlabs / docker-wkhtmltopdf-aas

Tested for concurrency? #11