izderadicka / pdfparser

Python binding to libpoppler with focus on text extraction
98 stars 46 forks source link

Crashes in Google Colab, am I using it correctly? #30

Closed xjdeng closed 2 years ago

xjdeng commented 3 years ago

Hi,

I'm trying to find a way to quickly extract text out of PDFs and I'm finding the speed of pdfminer to be unacceptably slow so I'm giving pdfparser a try. I decided to demo it on Google Colab:

!sudo apt-get update
!sudo apt-get install -y libpoppler-private-dev libpoppler-cpp-dev
!pip install cython
!pip install git+https://github.com/izderadicka/pdfparser

import pdfparser.poppler as pdf
import requests, tempfile

def read_pdf(url):
    tempdir = tempfile.TemporaryDirectory()
    temppath = tempdir.name + "/tmp.pdf"
    res = requests.get(url)
    res.raise_for_status()
    with open(temppath,'wb') as f:
        for chunk in res.iter_content(100000):
            f.write(chunk)
    with open(temppath,'rb') as f:
      d=pdf.Document(f.read())

      print('No of pages', d.no_of_pages)
      for p in d:
          print('Page', p.page_no, 'size =', p.size)
          for f in p:
              print(' '*1,'Flow')
              for b in f:
                  print(' '*2,'Block', 'bbox=', b.bbox.as_tuple())
                  for l in b:
                      print(' '*3, l.text.encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.bbox.as_tuple())
                      #assert l.char_fonts.comp_ratio < 1.0
                      for i in range(len(l.text)):
                          print(l.text[i].encode('UTF-8'), '(%0.2f, %0.2f, %0.2f, %0.2f)'% l.char_bboxes[i].as_tuple(),\
                              l.char_fonts[i].name, l.char_fonts[i].size, l.char_fonts[i].color,)
                      print()
      tempdir.cleanup()

read_pdf("https://arxiv.org/pdf/2104.00672.pdf")

However, the last line crashes with no explanation.

Am I using it correctly?

izderadicka commented 3 years ago

That would be probably OS pagefault error - check with gdb where it crashes in native code? Probably caused by version of libpoppler - check readme for instructions how to use with locally compiled version of libpoppler.