Closed se0siris closed 6 years ago
Hi @se0siris,
our C base library MuPDF is "threading-agnostic" ( their wording), i.e. does not itself provide multi-threading capabilities. Nonetheless, the MuPDF documentation contains hints / examples on how to support multi-threading - which we haven't bothered to dive into yet. But your inquiry does motivate me to give it a try ;-)
In the meantime, some information you may find interesting or even useful:
Methods doc.getPagePixmap()
( = page.getPixmap()
) are actually aggregates of low-level stuff.
The iteration over all document pages (mat
being a matrix like fitz.Matrix(0.2, 0.2)
to create thumbnail pictures of 20% original size):
for page in doc:
pix = page.getPixmap(matrix = mat, alpha = 0)
is just a short version of
for page in doc:
# create DisplayList -----------------------------------------------
r = page.rect # (+) the page mediabox
dl = fitz.DisplayList(r) # create a DisplayList
page.run(fitz.Device(dl), fitz.Identity) # run page thru DisplayList
page = None # page no longer needed
# generate Pixmap -----------------------------------------------
r.transform(mat) # (+) get thumbnail size of mediabox
ir = r.irect # (+) integer rectangle version
pix = fitz.Pixmap(fitz.csRGB, ir, 0) # (+) alloc pixmap (no transparency)
pix.clearWith(255) # clear its memory to "white"
dl.run(fitz.Device(pix, None), mat, r) # run DisplayList to fill the pixmap
Running the above through all of the Adobe manual (1.310 pages) took just a little over 5 seconds on my machine. Here are the details:
total time: 5.02797 sec
DisplayList time: 1.92725 sec
render time: 2.22468 sec
pixmap time 82.5766% of total
DisplayList vs rendering: 86.6305%
This means that by separating DisplayList and Pixmap creation in a clever way, an almost 50% reduction of overall runtime might be achievable.
BTW: If all document pages have the same dimension, then statements marked with a (+)
only need to be executed once - giving some minor (2%), but easily achievable improvements.
One option of attacking mass thumbnail creation could be: Put all document access in a separate (Python) task, which performs the open and then starts generating display lists (lets say, limited by some count). Whenever the main task wants the next thumbnail, the corresponding display list will be rendered, the pixmap returned and the display list deleted from the stack.
Let me think this through.
Please do provide more details on your requirements if / when available.
As mentioned in my previous post, creating pixmaps from pages involves mainly 2 steps per page, (1) creating a so-called "display list", (2) the actual rendering.
Display lists are a notion internal to MuPDF. They are more or less just the result of parsing a document page and provide a unified starting point for rendering and text extraction. Once a DL exists, the original page definition in the document never needs to be parsed again.
From my findings with several (PDF) example documents, the required time to perform these 2 steps are more or less in the same order of magnitude.
Whenever performance is an issue, a fairly easy action to undertake is creating display lists for (all) pages and then perform rendering / text extraction on these display lists only. To reduce coding effort, I have created new methods that resemble these two steps: page.getDisplayList()
, and dl.getPixmap()
. Here is one of my test skripts:
dl_tab = [] # stores display lists
for page in doc:
# create the DisplayLists for all pages -----------------------------------------
dl = page.getDisplayList()
dl_tab.append(dl)
for pno, dl in enumerate(dl_tab):
# render page image --------------------------------------------------------
pix = dl.getPixmap(mat)
#pix.writePNG("img-%i.png" % pno) # example use of the pixmaps
These two iterations need about an equal elapsed time per document (except pixmaps are saved as image files additionally - commented out here).
So, creating and storing display lists in variable dl_tab
could be performed as part of opening the document. Any rendering / text extraction afterwards would just use one of these display lists (their index obviously equals their page number) and thus need only 50% of the time of a full page.getPixmap()
.
The dl_tab
store of all pages of course leads to increased memory requirements. This should be around 200 MB for a file as large as the adobe manual (1.310 pages), or less.
MuPDF does allow multithreading - to a certain, limited extent.
Functions directly accessing the same document must run in the same thread where the document has been opened. In our case, this pertains to page.getDisplayList()
- these cannot be distributed across multiple threads.
Rendering and text extraction only access display lists (not the document), and therefore are multithreading candidates.
This means that only 50% of the overall processing time is available for mutlithread optimization - if we disregard the time to store the pixmaps in image files. Assuming a number of 4 (available!) CPU kernels, these 50% would ideally be brought down to 50/4 %, a more realistic figure is 2/3 savings.
Let me clarify this by using the Adobe manual example.
This might result in the following total elapsed time:
2.5 + 2.5 *0.33+ 4.0 *0.33 = 4.65 sec
as opposed to 9 sec w/o multithreading, or a saving of 48% elapsed time.
pthread.h
and associated shared libraries on Posix are not available on Windows, and we don't know about Mac OSX. At the end, there seems to be no way to implement something, that lets us keep a common source for all platforms we are currently supporting. Not to mention testing everything on every platform ...mutool draw -o img-%d.png -r 14.4 your.pdf m-n
via the Python subprocess.Popen
module (each with a suitably selected page range m-n
). The parameter -r
controls the thumbnail size: 14.4 is 20% of 72 pixels (=100%). This would result in independent OS subprocesses running fully (not just partly) in parallel. I have again tested the Adobe manual with parallel batches, each with at most 100 pages, which gave me a 2/3 overall runtime reduction (3.1 sec vs. 9.1 sec).This is an example skript invoking the utility:
import fitz
import time, sys
import subprocess
fname = sys.argv[1]
doc = fitz.open(fname)
pageCount = len(doc)
doc.close()
pranges = []
a = 0
while a < pageCount:
pranges.append("%i-%i" % (a, a + 100))
a += 101
cmd = "mutool draw -o img-%%d.png %s %s "
t0 = time.clock()
tasks =[]
for r in pranges:
t = subprocess.Popen(cmd % (fname, r), shell = True)
tasks.append(t)
for t in tasks:
t.wait()
t1 = time.clock()
t_time = t1 - t0
print("total time: %g sec" % t_time)
print("-".rjust(80,"-"))
@rk700 - I would value any comments you may have on this. Do you have any idea on how to implement multithreading on Mac OS? Does it support the pthreads "standard"? I have seen some open source project "pthreads on win32" for Pthreads on Windows, but I haven't used it and I do not know at all, if it would be source-portable to other platforms support pthreads, etc.
AFAIK, OSX uses its own thread model. And as you mentioned, pthread is not supported on Windows and some open source implementation may be unstable.
I'd like to use some others standards like c++11 threads, though I'm not sure if it is available on Windows.
Fully agree. VC++ (contained in Visual Studio 2015 and up) does support many C++11 features, and threading is among them! But before we do anything, we need to be sure about Mac OSX, too.
If we support threading some day, we need to offer an option to our users, whether they want to include threading or not. Otherwise we would impose on everyone to having a compiler supporting C++11, right?
Clang 3.3 and later implement c++11, so it's fine on OSX.
And yes we'll have to allow users to compile PyMuPDF either with multi-threading or not. (Maybe macros)
Hey guys,
Thanks for the replies on this - it's interesting to see the inner workings described at a lower level!
Reading back my original post I see that I could have been clearer and given more details. I'm generating thumbnails from multiple single-page PDF files that have already been extracted from a multi-page PDF. They're stored as single page mainly to make the initial loading of the "document" faster in my application over slow network connections. Using your example of the Adobe PDF reference doc, the multi-page PDF is 30.9MB whereas the first page of the split document is only 13KB. The total size of the split document is actually larger than the original multi-page version but when flitting between multiple documents it's worth it for this use, especially given that most of the time a user won't be viewing every page of a document anyway.
When a user selects a document in my application the filepaths for each page are pulled from a database and added to a Queue
. I then fire up a thread for each CPU core where helper functions for each supported file type are called to load the image, convert to a QImage
thumbnail (for displaying in the Qt based GUI) and emit that thumbnail back to the main thread for displaying in a scrollable list.
Given that the thumbnail creation is broken down into a separate function for each supported image type it's easy enough for me to swap out the current version that uses Poppler with one that uses PyMuPDF. Here's a simplified version of what each thread is doing to process a PDF file:
def load_pdf(filepath):
doc = fitz.Document(filepath)
if not doc:
image_object = QImage()
else:
matrix = fitz.Matrix(1, 1).preScale(0.1, 0.1)
pix = doc.getPagePixmap(0, matrix=matrix, alpha=False)
image_object = QImage(pix.samples, pix.width, pix.height, pix.width * pix.n,
QImage.Format_RGB888)
doc.close()
return image_object
I ran 3 tests with this code processing the Adobe document using a single thread, then again using a 4 threads:
PyMuPDF (single thread)
Thumbnails loaded in 17.31800 seconds
Thumbnails loaded in 17.82800 seconds
Thumbnails loaded in 17.49500 seconds
PyMuPDF (4 threads)
Thumbnails loaded in 17.26200 seconds
Thumbnails loaded in 17.40600 seconds
Thumbnails loaded in 17.25200 seconds
The results are pretty comparable and show that there's no performance benefit to be gained from using multiple threads here.
The function for Poppler is pretty much the same, but there's a significant speed boost when using threads:
def load_pdf(filepath):
doc = Poppler.Document.load(filepath)
if not doc:
image_object = QImage()
else:
page = doc.page(0)
image_object = page.renderToImage(40, 40) # DPI = 40
return image_object
Poppler (single thread)
Thumbnails loaded in 21.03600 seconds
Thumbnails loaded in 21.07000 seconds
Thumbnails loaded in 21.66500 seconds
Poppler (4 threads)
Thumbnails loaded in 6.46600 seconds
Thumbnails loaded in 6.76100 seconds
Thumbnails loaded in 6.78300 seconds
In addition to the speed improvement my application's GUI remains functional and usable while generating the thumbnails allowing me to show a progress bar and update other GUI areas whereas PyMuPDF locks the GUI until all thumbnails are returned.
My (limited) understanding of this is that Python does not support "true" multithreading and uses the GIL to block one thread while another is running. The exception being that when calling a C library the GIL can be released (within the C code) allowing for control to return back to Python while the C code carries on with its thing. I'm guessing this is something that the Poppler library is doing which would also explain why my application's GUI isn't locked while generating the thumbnails.
My hope is that it would be possible to simply add a "release GIL" wrapper around any parts of the C code that could make use of Python threads without much effort, but this is something I know nothing about. There seems to be something in the generated SWIG code staring at the point linked below but I don't know if that means it's being used or not.
I'm guessing if this was possible it would still require PyMuPDF to be compiled with multithreading enabled which looks to be the main hurdle discussed in your comments above, although things are looking positive!
I'm sorry for not being much use with the helping - I realise I could check out the code and attempt to build things myself but I have zero knowledge of C related things so there would be a lot to learn first! I figured if I were to take the route of playing around with things it would be worth at least asking if it's do-able.
Never mind - don't be shy asking / requesting stuff. The worst thing happening could be a "no" :-))
But don't worry, I'm a retired guy living in a beautiful country with lots of time ... If I say "no" then only because it's either beyond my capabilities or beyond what can be done with MuPDF.
First of all thank you for the background. Still making sure I understand every detail, though.
Apparently,
If this is correct so far, I will need to check whether MuPDF can be made to support this.
Just a little bit of background at this point:
What we actually seem to need for your case:
Another aspect is, what (if anything) must be done to get SWIG supporting this ...
A side question: You wrote you are splitting up PDFs into 1-pagers. Why don't you create PNG thumbnail images right away as part of the process? And then have Qt read those PNGs?
Another thing I came across:
Poppler obviously has its own special / direct interface to Qt: QImage Poppler::Page::renderToImage
as you mentioned.
I suspect this interface is responsible for the UI-non-blocking nature of the rendering. I.e. even if I find a way to multithread as described above: how / where do I pass the rendered pixmaps to?
@rk700 - any helpful experience with Qt or Poppler?
@JorjMcKie Not really. Sorry.
But I agree that the global context may be the one that prevents multi-threading. Currently it is a global variable and got initialized when the module is loaded, and we might somehow make it possible for threads to create their own contexts.
And SWIG itself supports creating interfaces with multi-threads by simply adding the option -threads
. I've tried regenerating the fitz_wrap.c
file and now every callback into the MuPDF library is wrapped with macros:
@@ -9537,7 +9660,11 @@ SWIGINTERN PyObject *_wrap_Document__getPageObjNumber(PyObject *SWIGUNUSEDPARM(s
}
arg2 = (int)(val2);
{
- result = (PyObject *)fz_document_s__getPageObjNumber(arg1,arg2);
+ {
+ SWIG_PYTHON_THREAD_BEGIN_ALLOW;
+ result = (PyObject *)fz_document_s__getPageObjNumber(arg1,arg2);
+ SWIG_PYTHON_THREAD_END_ALLOW;
+ }
if(!result)
{
I guess the above macros would release the GIL. And here's what I found for reference: https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock
Yet it still to be tested on all platforms.
MuPDF supports rendering pages in multi-threads. Here's an example though it only shows how to render multiple pages of a single document: https://github.com/muennich/mupdf/blob/master/docs/examples/multi-threaded.c
We can see that the global context is cloned in each thread by calling fz_clone_context()
. But in PyMuPDF, the global context is invisible to users and there's no way to choose what context to use. A simple workaround would be to let users create context all by themselves, but lots of interfaces would be changed then.
@rk700 - thanks for the hint. I have tried the -threads
option right away and it does work on WIndows, too.
In one of my test skripts I am using Python threading to create pixmaps - and this skript now crashes (because of missing MuPDF thread support). So this was a success ^^.
Maybe @se0siris, you can try the same and simply see what happens, i.e. what type of crash you are getting.
In the meantime I will try to generate a threading version of PyMuPDF ...
@rk700 After looking around for a while, I now also believe that multiple global contexts ("GCs") should be the solution.
fz_clone_context()
is in terms of runtime. But it obviously means copying the whole existing context into a new area, except the exception stack, which is specific to a thread.In contrast, using separate GCs would allow for much more flexibility:
So my current thinking is not to go for MuPDF threading, but for multiple GCs instead.
I am not certain about the best approach yet. What do you think about this:
The interesting part is: how should the user / programmer be able to switch between GCs?
context = n
in all classes and methods?setGC(n)
? This would store GC pointer table entry n
in gctx
, and all further MuPDF calls would be using this GC.We would also have to prevent different tasks using the same GC in some way - because we need the -threads
option, this type of error is possible. We could use MuPDF threading to prevent this using the GC as a mutex resource. The GC structure contains an id
and also a user field, that could be used for this.
May be the easiest way to solve this: Restrict creating additional GCs to new Python threads. The pseudocode running a thread would be:
gcid = fitz.newGC() # allocate a new global context
fitz.setGC(gcid) # use it under the variable `gctx`
doc = fitz.open(...) # normal thread ...
pix = doc.getPagePixmap(...) # ... specific processing
doc.close() # close document
image_object = .... # create value to be returned
fitz.delGC(gcid) # release GC and set `gctx = GC0`
return image_object # return result
Hm - the above will probably not be feasible, because "gctx" will be overwritten ba parallel tasks ...
I'm afraid we have to change all code to no longer using "gctx" but instead looking up its relevant GC the id of which must be passed in in every call: instead of fitz.open(filename)
we need to use fitz.open(filename, gc = gcid)
, etc.
Hm.
@rk700 I have tried a few in the meantime., but I wasn't successful along the aforementioned model. I see the following requirements for any potential solution:
clone ctx / open pdf / get pixmap / close pdf / drop ctx clone / return pixmap
.fitz.Document
. Once the document is created, it should store its context (as a object attribute) and pass it along to all dependent objects' creations (pages, links, ...).RECT
, MATRIX
, etc.) which are context-independant.I am stuck finding a good way to store the current ctx address. One way could be to use a new attribute Document.context
. This can be either an integer (problematic), or a PyCapsule object. Every method of Document
and every dependent object would then extract the context pointer from its Document.context
and use is in place of the current variable gctx
.
Whatever I try, there seems to be no way to make ~fz_document_s()
accept parameters deriving from Document.context
.
Can you think of another way? What about using a global table (let's say limited to 10 entries) of contexts (which would then need some threading-safe [!!!] maintenance functions for filling, selecting, emptying, deleting ...)?
How about using TLS(Thread Local Storage)? I know that TLS can be used for storing global variables for each thread, though I haven't used it yet.
I'll take a deeper look at it too.
And @JorjMcKie could you please show me the python script for multi-threads testing?
I read the MuPDF's multi-thread example again, and I think that we may have to call fz_clone_context()
instead of creating a totally new context for each thread. Some important resources such as the store and glyphcache can be shared among threads by using the cloned context.
And in order to minimize changes to our code, We could add some wrapper functions which looks like the following:
def renderPage(pageNumbers=[0,1,2], threads=4):
pass
In which the C multi-threading code would be called.
@rk700
Here is one of multithreading tests. It renders page 0 of all PDFs contained in a directory (to simulate what @se0siris is doing):
test.txt
This skript works ok without SWIG -threads
parameter, but of course has the same runtime as a similar skript without threading (about 5 sec for 150 PDFs on my machine).
The interpreter crashes however, when running it with a PyMuPDF generated using -threads
.
If I understand you right, then PyMuPDF should stay single threaded from a Python perspective. Indeed, the code change for making the global context a parameter is big! And, for single file access, no speed improvement would result.
We should however offer a selected few high-speed functions for e.g. page rendering. These functions would be using MuPDF's C-level multi-threading.
From a Python perspective, renderPage
would simply be a method returning 4 pixmaps ... from a single document. The function would be coded much like the MuPDF example.
For simultaneous rendering of several files, another function would be needed, which on the C-level would contain open, render, close
in each of its threads.
I think I make a very serious mistake from the beginning. A context should be bound to a document instead of been used as a global variable.
As you mentioned above, maybe we'll have to postpone the initialization of context until a document is created.
I just tried using TLS for context, which would minimize changes to the current code. But the test result is not as expected.
gctx
in TLS and initialize it when creating document. Then add -threads
option to SWIG. Time for 268 documents is 3.29221 seconds.I'm a little confused and doubt that whether the context is the bottleneck.
Interesting discussion! I think we have to thank @se0siris for the inspiration his issue is generating ^^!
@rk700 - I would like to defend your original design decision: there are very few pieces of MuPDF that do not use a context - the full area of geometry (points, rectangles, matrices) is an example. But even working with pixmaps requires a context, whether reading from or writing to image files - without touching any document.
Joining (PDF) documents is an example why not all documents can have their own context: these functions only work with a context common to both.
My above example about "threaded" processing of documents (@se0siris' requirement) would imply opening a document using a cloned context. I am not even sure whether that can work at all ...
I do not want to overstate the obvious, but every threading comes with an overhead. And the Python-specific threading overhead: I would assume it is bigger than C-level threading.
But also the OS threading overhead is quite remarkable: I used subprocess.Popen
to render Adobe's manual 1310 pages in independent processes. My machine has 8 kernels, and I could see that 6 of them were in use for the rendering skripts. But instead dividing the overall rendering runtime by 6, the resulting division was by about 3: instead of 9 seconds the whole thing needed about 3. This was independent from using an own skript for rendering or mutool draw
.
Of course there was overhead not to be attributed to threading itself: setting up the (Py-) MuPDF environment and opening the document 1310 times ...
But still: the smaller the piece of work of a thread, the higher the threading overhead impact.
I have done a number of more tests in the meantime. I was using my set of 152 PDFs - all of them scientific magazines with sizes ranging from 8 MB to 34 MB each (14 MB average), graphics-oriented.
The main objective was to get an understanding of the threading overhead and the time required to do a fitz.open()
for non-trivial PDF files.
Note that threading measurments were done with a non-threading PyMuPDF. This means:
If we ever support the global context as a parameter in PyMuPDF, the threading overhead will go up (e.g. because of global context allocation and de-allocation)!
fitz.open()
times Python 3.6fitz.open()
times Python 2.7import fitz
import time
import os
pdfs = os.listdir(pdfdir) # PDF directory
t0 = time.clock()
cnt = 0
for i, pdf in enumerate(pdfs):
if not pdf.endswith(".pdf"):
continue
cnt += 1
pth = os.path.join(pdfdir, pdf)
doc = fitz.open(pth)
t1 = time.clock()
print("time needed for %i documents: %g" % (cnt, t1-t0))
print("average time %g" % ((t1-t0)/cnt))
import fitz
import time
import os
import threading
pdfs = os.listdir(pdfdir) # PDF directory
def worker(fname):
doc = fitz.open(fname)
t0 = time.clock()
cnt = 0
threads = []
for i, pdf in enumerate(pdfs):
if not pdf.endswith(".pdf"):
continue
cnt += 1
pth = os.path.join(pdfdir, pdf)
t = threading.Thread(target = worker, args = [pth])
t.start()
threads.append(t)
for t in threads:
t.join()
t1 = time.clock()
print("time needed for %i documents: %g" % (cnt, t1-t0))
print("average time %g" % ((t1-t0)/cnt))
Time has gone by and no new ideas popped up. I still cannot see use cases where signifant throughput improvements are even probable, not to mention the significant effort that such a change would entail. So I take the liberty to close the issue and label it "won't fix".
@se0siris Can you show some code on how to render PDF using PyMuPDF in pyqt?
I have a PyQt4 application that is currently using Poppler (via python-poppler-qt4) and have ported it over to use PyMuPDF as a test. PDF loading and rendering is noticeably faster with PyMuPDF for the most part, but when generating thumbnails inside threads it seems only one thread is able to run at a time. With Poppler they work concurrently and the thumbnail generation in my application is faster as a result.
I'm wondering if it's possible to release the GIL, specifically when calling
Document.getPagePixmap()
, so that batch operations using threads can receive a speed boost?