metachris / pdfx

Extract text, metadata and references (pdf, url, doi, arxiv) from PDF. Optionally download all referenced PDFs.
http://www.metachris.com/pdfx
Apache License 2.0
1.05k stars 115 forks source link

TypeError: '<' not supported between instances of 'tuple' and 'int' #32

Closed sarora closed 3 years ago

sarora commented 5 years ago

Getting an error while passing a url in PDFx function.

Here is the traceback:

 File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/__init__.py", line 127, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/backends.py", line 167, in __init__
    doc = PDFDocument(parser, password=password, caching=True)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 558, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 782, in read_xref_from
    xref.load(parser)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 235, in load
    (_, stream) = parser.nextobject()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 582, in nextobject
    (pos, token) = self.nexttoken()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 508, in nexttoken
    self.fillbuf()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 232, in fillbuf
    if self.charpos < len(self.buf):
TypeError: '<' not supported between instances of 'tuple' and 'int'
sarora commented 5 years ago

Similar bug here https://github.com/pdfminer/pdfminer.six/issues/89

sarora commented 5 years ago

When I make the change for the bug fix given in the issue above a new bug pops up.

TypeError: int() argument must be a string, a bytes-like object or a number, not 'PSKeyword'

  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/__init__.py", line 127, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/backends.py", line 167, in __init__
    doc = PDFDocument(parser, password=password, caching=True)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 558, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 782, in read_xref_from
    xref.load(parser)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 235, in load
    (_, stream) = parser.nextobject()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 624, in nextobject
    self.do_keyword(pos, token)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfparser.py", line 77, in do_keyword
    (objid, genno) = (int(objid), int(genno))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'PSKeyword'
sarora commented 5 years ago

Along with this.

 line 127, in __init__
    self.reader = PDFMinerBackend(self.stream)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfx/backends.py", line 167, in __init__
    doc = PDFDocument(parser, password=password, caching=True)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 558, in __init__
    self.read_xref_from(parser, pos, self.xrefs)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 782, in read_xref_from
    xref.load(parser)
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/pdfdocument.py", line 235, in load
    (_, stream) = parser.nextobject()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 582, in nextobject
    (pos, token) = self.nexttoken()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 508, in nexttoken
    self.fillbuf()
  File "/home/siddhartha/anaconda3/envs/straikit/lib/python3.6/site-packages/pdfminer/psparser.py", line 238, in fillbuf
    raise PSEOF('Unexpected EOF')
Crescentz commented 5 years ago

File "/home/zy/miniconda3/envs/py36pc2t/lib/python3.6/site-packages/pdfminer/pdfinterp.py", line 248, in fillbuf if self.charpos < len(self.buf): TypeError: '<' not supported between instances of 'tuple' and 'int'

Also have this problem

pietermarsman commented 5 years ago

This is fixed in pdfminer.six by https://github.com/pdfminer/pdfminer.six/pull/134

metachris commented 3 years ago

Fixed in v1.4.1, thanks!