pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.46k stars 513 forks source link

Releasing the GIL in PyMuPDF operations #97

Closed se0siris closed 6 years ago

se0siris commented 7 years ago

I have a PyQt4 application that is currently using Poppler (via python-poppler-qt4) and have ported it over to use PyMuPDF as a test. PDF loading and rendering is noticeably faster with PyMuPDF for the most part, but when generating thumbnails inside threads it seems only one thread is able to run at a time. With Poppler they work concurrently and the thumbnail generation in my application is faster as a result.

I'm wondering if it's possible to release the GIL, specifically when calling Document.getPagePixmap(), so that batch operations using threads can receive a speed boost?

JorjMcKie commented 7 years ago

Hi @se0siris,

our C base library MuPDF is "threading-agnostic" ( their wording), i.e. does not itself provide multi-threading capabilities. Nonetheless, the MuPDF documentation contains hints / examples on how to support multi-threading - which we haven't bothered to dive into yet. But your inquiry does motivate me to give it a try ;-)

In the meantime, some information you may find interesting or even useful:

Methods doc.getPagePixmap() ( = page.getPixmap()) are actually aggregates of low-level stuff.

The iteration over all document pages (mat being a matrix like fitz.Matrix(0.2, 0.2) to create thumbnail pictures of 20% original size):

for page in doc:
    pix = page.getPixmap(matrix = mat, alpha = 0)

is just a short version of

for page in doc:
    # create DisplayList -----------------------------------------------
    r = page.rect                               # (+) the page mediabox
    dl = fitz.DisplayList(r)                    # create a DisplayList
    page.run(fitz.Device(dl), fitz.Identity)    # run page thru DisplayList
    page = None                                 # page no longer needed
    # generate Pixmap -----------------------------------------------
    r.transform(mat)                            # (+) get thumbnail size of mediabox
    ir = r.irect                                # (+) integer rectangle version
    pix = fitz.Pixmap(fitz.csRGB, ir, 0)        # (+) alloc pixmap (no transparency)
    pix.clearWith(255)                          # clear its memory to "white"
    dl.run(fitz.Device(pix, None), mat, r)      # run DisplayList to fill the pixmap

Running the above through all of the Adobe manual (1.310 pages) took just a little over 5 seconds on my machine. Here are the details:

total time: 5.02797 sec
DisplayList time: 1.92725 sec
render time: 2.22468 sec
pixmap time 82.5766% of total
DisplayList vs rendering: 86.6305%

This means that by separating DisplayList and Pixmap creation in a clever way, an almost 50% reduction of overall runtime might be achievable.

BTW: If all document pages have the same dimension, then statements marked with a (+) only need to be executed once - giving some minor (2%), but easily achievable improvements.

One option of attacking mass thumbnail creation could be: Put all document access in a separate (Python) task, which performs the open and then starts generating display lists (lets say, limited by some count). Whenever the main task wants the next thumbnail, the corresponding display list will be rendered, the pixmap returned and the display list deleted from the stack.

Let me think this through.

Please do provide more details on your requirements if / when available.

JorjMcKie commented 7 years ago

Findings

As mentioned in my previous post, creating pixmaps from pages involves mainly 2 steps per page, (1) creating a so-called "display list", (2) the actual rendering.

Display lists are a notion internal to MuPDF. They are more or less just the result of parsing a document page and provide a unified starting point for rendering and text extraction. Once a DL exists, the original page definition in the document never needs to be parsed again.

From my findings with several (PDF) example documents, the required time to perform these 2 steps are more or less in the same order of magnitude.

Whenever performance is an issue, a fairly easy action to undertake is creating display lists for (all) pages and then perform rendering / text extraction on these display lists only. To reduce coding effort, I have created new methods that resemble these two steps: page.getDisplayList(), and dl.getPixmap(). Here is one of my test skripts:

dl_tab = []                       # stores display lists
for page in doc:
    # create the DisplayLists for all pages -----------------------------------------
    dl = page.getDisplayList()
    dl_tab.append(dl)

for pno, dl in enumerate(dl_tab):
    # render page image --------------------------------------------------------
    pix = dl.getPixmap(mat)
    #pix.writePNG("img-%i.png" % pno)  # example use of the pixmaps

These two iterations need about an equal elapsed time per document (except pixmaps are saved as image files additionally - commented out here).

So, creating and storing display lists in variable dl_tab could be performed as part of opening the document. Any rendering / text extraction afterwards would just use one of these display lists (their index obviously equals their page number) and thus need only 50% of the time of a full page.getPixmap() .

The dl_tab store of all pages of course leads to increased memory requirements. This should be around 200 MB for a file as large as the adobe manual (1.310 pages), or less.

What about multithreading?

Savings potential

MuPDF does allow multithreading - to a certain, limited extent.

  1. Functions directly accessing the same document must run in the same thread where the document has been opened. In our case, this pertains to page.getDisplayList() - these cannot be distributed across multiple threads.

  2. Rendering and text extraction only access display lists (not the document), and therefore are multithreading candidates.

This means that only 50% of the overall processing time is available for mutlithread optimization - if we disregard the time to store the pixmaps in image files. Assuming a number of 4 (available!) CPU kernels, these 50% would ideally be brought down to 50/4 %, a more realistic figure is 2/3 savings.

Let me clarify this by using the Adobe manual example.

This might result in the following total elapsed time:

2.5 + 2.5 *0.33+ 4.0 *0.33 = 4.65 sec

as opposed to 9 sec w/o multithreading, or a saving of 48% elapsed time.

MuPDF multithreading prerequisites

This is an example skript invoking the utility:

 import fitz
 import time, sys
 import subprocess
 fname = sys.argv[1]
 doc = fitz.open(fname)
 pageCount = len(doc)
 doc.close()
 pranges = []
 a = 0
 while a < pageCount:
     pranges.append("%i-%i" % (a, a + 100))
     a += 101
 cmd = "mutool draw -o img-%%d.png %s %s "
 t0 = time.clock()
 tasks =[]
 for r in pranges:
     t = subprocess.Popen(cmd % (fname, r), shell = True)
     tasks.append(t)
 for t in tasks:
     t.wait()
 t1 = time.clock()
 t_time = t1 - t0
 print("total time: %g sec" % t_time)
 print("-".rjust(80,"-"))
JorjMcKie commented 7 years ago

@rk700 - I would value any comments you may have on this. Do you have any idea on how to implement multithreading on Mac OS? Does it support the pthreads "standard"? I have seen some open source project "pthreads on win32" for Pthreads on Windows, but I haven't used it and I do not know at all, if it would be source-portable to other platforms support pthreads, etc.

rk700 commented 7 years ago

AFAIK, OSX uses its own thread model. And as you mentioned, pthread is not supported on Windows and some open source implementation may be unstable.

I'd like to use some others standards like c++11 threads, though I'm not sure if it is available on Windows.

JorjMcKie commented 7 years ago

Fully agree. VC++ (contained in Visual Studio 2015 and up) does support many C++11 features, and threading is among them! But before we do anything, we need to be sure about Mac OSX, too.

If we support threading some day, we need to offer an option to our users, whether they want to include threading or not. Otherwise we would impose on everyone to having a compiler supporting C++11, right?

rk700 commented 7 years ago

Clang 3.3 and later implement c++11, so it's fine on OSX.

And yes we'll have to allow users to compile PyMuPDF either with multi-threading or not. (Maybe macros)

se0siris commented 7 years ago

Hey guys,

Thanks for the replies on this - it's interesting to see the inner workings described at a lower level!

Reading back my original post I see that I could have been clearer and given more details. I'm generating thumbnails from multiple single-page PDF files that have already been extracted from a multi-page PDF. They're stored as single page mainly to make the initial loading of the "document" faster in my application over slow network connections. Using your example of the Adobe PDF reference doc, the multi-page PDF is 30.9MB whereas the first page of the split document is only 13KB. The total size of the split document is actually larger than the original multi-page version but when flitting between multiple documents it's worth it for this use, especially given that most of the time a user won't be viewing every page of a document anyway.

When a user selects a document in my application the filepaths for each page are pulled from a database and added to a Queue. I then fire up a thread for each CPU core where helper functions for each supported file type are called to load the image, convert to a QImage thumbnail (for displaying in the Qt based GUI) and emit that thumbnail back to the main thread for displaying in a scrollable list.

Given that the thumbnail creation is broken down into a separate function for each supported image type it's easy enough for me to swap out the current version that uses Poppler with one that uses PyMuPDF. Here's a simplified version of what each thread is doing to process a PDF file:

def load_pdf(filepath):
    doc = fitz.Document(filepath)

    if not doc:
        image_object = QImage()
    else:
        matrix = fitz.Matrix(1, 1).preScale(0.1, 0.1)
        pix = doc.getPagePixmap(0, matrix=matrix, alpha=False)
        image_object = QImage(pix.samples, pix.width, pix.height, pix.width * pix.n,
                              QImage.Format_RGB888)
        doc.close()
    return image_object

I ran 3 tests with this code processing the Adobe document using a single thread, then again using a 4 threads:

PyMuPDF (single thread)
Thumbnails loaded in 17.31800 seconds
Thumbnails loaded in 17.82800 seconds
Thumbnails loaded in 17.49500 seconds

PyMuPDF (4 threads)
Thumbnails loaded in 17.26200 seconds
Thumbnails loaded in 17.40600 seconds
Thumbnails loaded in 17.25200 seconds

The results are pretty comparable and show that there's no performance benefit to be gained from using multiple threads here.

The function for Poppler is pretty much the same, but there's a significant speed boost when using threads:

def load_pdf(filepath):
    doc = Poppler.Document.load(filepath)

    if not doc:
        image_object = QImage()
    else:
        page = doc.page(0)
        image_object = page.renderToImage(40, 40)  # DPI = 40
    return image_object
Poppler (single thread)
Thumbnails loaded in 21.03600 seconds
Thumbnails loaded in 21.07000 seconds
Thumbnails loaded in 21.66500 seconds

Poppler (4 threads)
Thumbnails loaded in 6.46600 seconds
Thumbnails loaded in 6.76100 seconds
Thumbnails loaded in 6.78300 seconds

In addition to the speed improvement my application's GUI remains functional and usable while generating the thumbnails allowing me to show a progress bar and update other GUI areas whereas PyMuPDF locks the GUI until all thumbnails are returned.

My (limited) understanding of this is that Python does not support "true" multithreading and uses the GIL to block one thread while another is running. The exception being that when calling a C library the GIL can be released (within the C code) allowing for control to return back to Python while the C code carries on with its thing. I'm guessing this is something that the Poppler library is doing which would also explain why my application's GUI isn't locked while generating the thumbnails.

My hope is that it would be possible to simply add a "release GIL" wrapper around any parts of the C code that could make use of Python threads without much effort, but this is something I know nothing about. There seems to be something in the generated SWIG code staring at the point linked below but I don't know if that means it's being used or not.

https://github.com/rk700/PyMuPDF/blob/39cee218dc3c259decadaf40b1f5a55b3682989c/fitz/fitz_wrap.c#L1029

I'm guessing if this was possible it would still require PyMuPDF to be compiled with multithreading enabled which looks to be the main hurdle discussed in your comments above, although things are looking positive!

I'm sorry for not being much use with the helping - I realise I could check out the code and attempt to build things myself but I have zero knowledge of C related things so there would be a lot to learn first! I figured if I were to take the route of playing around with things it would be worth at least asking if it's do-able.

JorjMcKie commented 7 years ago

Never mind - don't be shy asking / requesting stuff. The worst thing happening could be a "no" :-))

But don't worry, I'm a retired guy living in a beautiful country with lots of time ... If I say "no" then only because it's either beyond my capabilities or beyond what can be done with MuPDF.

First of all thank you for the background. Still making sure I understand every detail, though.

Apparently,

If this is correct so far, I will need to check whether MuPDF can be made to support this.

Just a little bit of background at this point:

What we actually seem to need for your case:

Another aspect is, what (if anything) must be done to get SWIG supporting this ...

A side question: You wrote you are splitting up PDFs into 1-pagers. Why don't you create PNG thumbnail images right away as part of the process? And then have Qt read those PNGs?

JorjMcKie commented 7 years ago

Another thing I came across:

Poppler obviously has its own special / direct interface to Qt: QImage Poppler::Page::renderToImage as you mentioned.

I suspect this interface is responsible for the UI-non-blocking nature of the rendering. I.e. even if I find a way to multithread as described above: how / where do I pass the rendered pixmaps to?

@rk700 - any helpful experience with Qt or Poppler?

rk700 commented 7 years ago

@JorjMcKie Not really. Sorry.

But I agree that the global context may be the one that prevents multi-threading. Currently it is a global variable and got initialized when the module is loaded, and we might somehow make it possible for threads to create their own contexts.

And SWIG itself supports creating interfaces with multi-threads by simply adding the option -threads. I've tried regenerating the fitz_wrap.c file and now every callback into the MuPDF library is wrapped with macros:

@@ -9537,7 +9660,11 @@ SWIGINTERN PyObject *_wrap_Document__getPageObjNumber(PyObject *SWIGUNUSEDPARM(s
   } 
   arg2 = (int)(val2);
   {
-    result = (PyObject *)fz_document_s__getPageObjNumber(arg1,arg2);
+    {
+      SWIG_PYTHON_THREAD_BEGIN_ALLOW;
+      result = (PyObject *)fz_document_s__getPageObjNumber(arg1,arg2);
+      SWIG_PYTHON_THREAD_END_ALLOW;
+    }
     if(!result)
     {

I guess the above macros would release the GIL. And here's what I found for reference: https://docs.python.org/3/c-api/init.html#thread-state-and-the-global-interpreter-lock

Yet it still to be tested on all platforms.

rk700 commented 7 years ago

MuPDF supports rendering pages in multi-threads. Here's an example though it only shows how to render multiple pages of a single document: https://github.com/muennich/mupdf/blob/master/docs/examples/multi-threaded.c

We can see that the global context is cloned in each thread by calling fz_clone_context(). But in PyMuPDF, the global context is invisible to users and there's no way to choose what context to use. A simple workaround would be to let users create context all by themselves, but lots of interfaces would be changed then.

JorjMcKie commented 7 years ago

@rk700 - thanks for the hint. I have tried the -threads option right away and it does work on WIndows, too. In one of my test skripts I am using Python threading to create pixmaps - and this skript now crashes (because of missing MuPDF thread support). So this was a success ^^.

Maybe @se0siris, you can try the same and simply see what happens, i.e. what type of crash you are getting.

In the meantime I will try to generate a threading version of PyMuPDF ...

JorjMcKie commented 7 years ago

@rk700 After looking around for a while, I now also believe that multiple global contexts ("GCs") should be the solution.

  1. MuPDF threading only solves a small set of parallelization requirements as has been mentioned before: working directly with a document has to happen in one single (and hence the main) thread. So I do not see many more benefits of this alternative than rendering images from display lists that must have been created before (in the main thread). @rk700: Can you think of anything alse?
  2. I am not sure what the impact of fz_clone_context() is in terms of runtime. But it obviously means copying the whole existing context into a new area, except the exception stack, which is specific to a thread.
  3. Requirements like that of @se0siris can not be satisfied with this option.

In contrast, using separate GCs would allow for much more flexibility:

  1. practically equivalent to starting several copies of PyMuPDF,
  2. @se0siris Requirements would be satisfied,
  3. MuPDF-threading could still be implemented in each GC (if it ever becomes necessary).

So my current thinking is not to go for MuPDF threading, but for multiple GCs instead.

I am not certain about the best approach yet. What do you think about this:

The interesting part is: how should the user / programmer be able to switch between GCs?

We would also have to prevent different tasks using the same GC in some way - because we need the -threads option, this type of error is possible. We could use MuPDF threading to prevent this using the GC as a mutex resource. The GC structure contains an id and also a user field, that could be used for this.

May be the easiest way to solve this: Restrict creating additional GCs to new Python threads. The pseudocode running a thread would be:

gcid = fitz.newGC()            # allocate a new global context
fitz.setGC(gcid)               # use it under the variable `gctx`
doc = fitz.open(...)           # normal thread ...
pix = doc.getPagePixmap(...)   # ... specific processing
doc.close()                    # close document
image_object = ....            # create value to be returned
fitz.delGC(gcid)               # release GC and set `gctx = GC0`
return image_object            # return result
JorjMcKie commented 7 years ago

Hm - the above will probably not be feasible, because "gctx" will be overwritten ba parallel tasks ... I'm afraid we have to change all code to no longer using "gctx" but instead looking up its relevant GC the id of which must be passed in in every call: instead of fitz.open(filename) we need to use fitz.open(filename, gc = gcid), etc. Hm.

JorjMcKie commented 7 years ago

@rk700 I have tried a few in the meantime., but I wasn't successful along the aforementioned model. I see the following requirements for any potential solution:

I am stuck finding a good way to store the current ctx address. One way could be to use a new attribute Document.context. This can be either an integer (problematic), or a PyCapsule object. Every method of Document and every dependent object would then extract the context pointer from its Document.context and use is in place of the current variable gctx.

Whatever I try, there seems to be no way to make ~fz_document_s() accept parameters deriving from Document.context.

Can you think of another way? What about using a global table (let's say limited to 10 entries) of contexts (which would then need some threading-safe [!!!] maintenance functions for filling, selecting, emptying, deleting ...)?

rk700 commented 7 years ago

How about using TLS(Thread Local Storage)? I know that TLS can be used for storing global variables for each thread, though I haven't used it yet.

I'll take a deeper look at it too.

rk700 commented 7 years ago

And @JorjMcKie could you please show me the python script for multi-threads testing?

rk700 commented 7 years ago

I read the MuPDF's multi-thread example again, and I think that we may have to call fz_clone_context() instead of creating a totally new context for each thread. Some important resources such as the store and glyphcache can be shared among threads by using the cloned context.

And in order to minimize changes to our code, We could add some wrapper functions which looks like the following:

def renderPage(pageNumbers=[0,1,2], threads=4):
    pass

In which the C multi-threading code would be called.

https://www.mupdf.com/docs/overview

JorjMcKie commented 7 years ago

@rk700 Here is one of multithreading tests. It renders page 0 of all PDFs contained in a directory (to simulate what @se0siris is doing): test.txt This skript works ok without SWIG -threads parameter, but of course has the same runtime as a similar skript without threading (about 5 sec for 150 PDFs on my machine). The interpreter crashes however, when running it with a PyMuPDF generated using -threads.

JorjMcKie commented 7 years ago

If I understand you right, then PyMuPDF should stay single threaded from a Python perspective. Indeed, the code change for making the global context a parameter is big! And, for single file access, no speed improvement would result.

We should however offer a selected few high-speed functions for e.g. page rendering. These functions would be using MuPDF's C-level multi-threading. From a Python perspective, renderPage would simply be a method returning 4 pixmaps ... from a single document. The function would be coded much like the MuPDF example.

For simultaneous rendering of several files, another function would be needed, which on the C-level would contain open, render, close in each of its threads.

rk700 commented 7 years ago

I think I make a very serious mistake from the beginning. A context should be bound to a document instead of been used as a global variable.

As you mentioned above, maybe we'll have to postpone the initialization of context until a document is created.

rk700 commented 7 years ago

I just tried using TLS for context, which would minimize changes to the current code. But the test result is not as expected.

  1. Use the original code, and time for 268 documents is 2.48191 seconds.
  2. Put gctx in TLS and initialize it when creating document. Then add -threads option to SWIG. Time for 268 documents is 3.29221 seconds.
  3. Time for a single document is 0.010867 seconds.

I'm a little confused and doubt that whether the context is the bottleneck.

JorjMcKie commented 7 years ago

Interesting discussion! I think we have to thank @se0siris for the inspiration his issue is generating ^^!

@rk700 - I would like to defend your original design decision: there are very few pieces of MuPDF that do not use a context - the full area of geometry (points, rectangles, matrices) is an example. But even working with pixmaps requires a context, whether reading from or writing to image files - without touching any document.

Joining (PDF) documents is an example why not all documents can have their own context: these functions only work with a context common to both.

My above example about "threaded" processing of documents (@se0siris' requirement) would imply opening a document using a cloned context. I am not even sure whether that can work at all ...

JorjMcKie commented 7 years ago

I do not want to overstate the obvious, but every threading comes with an overhead. And the Python-specific threading overhead: I would assume it is bigger than C-level threading.

But also the OS threading overhead is quite remarkable: I used subprocess.Popen to render Adobe's manual 1310 pages in independent processes. My machine has 8 kernels, and I could see that 6 of them were in use for the rendering skripts. But instead dividing the overall rendering runtime by 6, the resulting division was by about 3: instead of 9 seconds the whole thing needed about 3. This was independent from using an own skript for rendering or mutool draw.

Of course there was overhead not to be attributed to threading itself: setting up the (Py-) MuPDF environment and opening the document 1310 times ...

But still: the smaller the piece of work of a thread, the higher the threading overhead impact.

JorjMcKie commented 7 years ago

I have done a number of more tests in the meantime. I was using my set of 152 PDFs - all of them scientific magazines with sizes ranging from 8 MB to 34 MB each (14 MB average), graphics-oriented. The main objective was to get an understanding of the threading overhead and the time required to do a fitz.open() for non-trivial PDF files.

Note that threading measurments were done with a non-threading PyMuPDF. This means:

If we ever support the global context as a parameter in PyMuPDF, the threading overhead will go up (e.g. because of global context allocation and de-allocation)!

Findings

fitz.open() times Python 3.6

fitz.open() times Python 2.7

Skipts Used

Non-Treading

import fitz
import time
import os

pdfs = os.listdir(pdfdir)     # PDF directory
t0 = time.clock()
cnt = 0
for i, pdf in enumerate(pdfs):
    if not pdf.endswith(".pdf"):
        continue
    cnt += 1
    pth = os.path.join(pdfdir, pdf)
    doc = fitz.open(pth)

t1 = time.clock()
print("time needed for %i documents: %g" % (cnt, t1-t0))
print("average time %g" % ((t1-t0)/cnt))

Threading

import fitz
import time
import os
import threading

pdfs = os.listdir(pdfdir)     # PDF directory

def worker(fname):
    doc = fitz.open(fname)

t0 = time.clock()
cnt = 0
threads = []
for i, pdf in enumerate(pdfs):
    if not pdf.endswith(".pdf"):
        continue
    cnt += 1
    pth = os.path.join(pdfdir, pdf)
    t = threading.Thread(target = worker, args = [pth])
    t.start()
    threads.append(t)

for t in threads:
    t.join()
t1 = time.clock()
print("time needed for %i documents: %g" % (cnt, t1-t0))
print("average time %g" % ((t1-t0)/cnt))
JorjMcKie commented 6 years ago

Time has gone by and no new ideas popped up. I still cannot see use cases where signifant throughput improvements are even probable, not to mention the significant effort that such a change would entail. So I take the liberty to close the issue and label it "won't fix".

redstoneleo commented 6 years ago

@se0siris Can you show some code on how to render PDF using PyMuPDF in pyqt?