inline image parsing fails when stream data contains "EI\n"

dhdaines commented 2 months ago

In the case where an inline image uses the ASCII85Decode filter, it can (and frequently does) contain the sequence "EI" internally at the end of a line. This confuses the pdf parser and can lead to a variety of weird symptoms, since it will attempt to parse the rest of the stream data as the rest of the containing content stream, which it most definitely is not. I just happened to notice this because the attached PDF has this problem (the symptom in particular is that it comes across the sequence "\611" which is a very invalid octal escape):

222-2008-zonage-annexe-c-carte-25b-innond.pdf

The PDF spec is not tremendously helpful, but once you realize that an ASCII85Decode stream must end with "~>" it's obvious that we should be looking for that, and not "EI" followed by whitespace, in the case where this encoding is used. This should be as simple as checking if /A85 is in the image dictionary and then passing "~>" instead of "EI" to PDFContentParser.get_inline_data. I'll make a PR.

pietermarsman commented 2 months ago

I can reproduce this with:

$ python tools/pdf2txt.py 222-2008-zonage-annexe-c-carte-25b-innond.pdf
Traceback (most recent call last):
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 318, in <module>
    sys.exit(main())
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 312, in main
    outfp = extract_text(**vars(parsed_args))
  File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 63, in extract_text
    pdfminer.high_level.extract_text_to_fp(fp, **locals())
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/high_level.py", line 133, in extract_text_to_fp
    interpreter.process_page(page)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 997, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1016, in render_contents
    self.execute(list_value(streams))
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1027, in execute
    (_, obj) = parser.nextobject()
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 601, in nextobject
    (pos, token) = self.nexttoken()
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 518, in nexttoken
    self.charpos = self._parse1(self.buf, self.charpos)
  File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 467, in _parse_string_1
    self._curtoken += bytes((int(self.oct, 8),))
ValueError: bytes must be in range(0, 256)

dhdaines commented 2 months ago

Yes - as you note in the PR, it is when /A85 is the first filter in the list, for obvious reasons (it is very unlikely that it wouldn't be the first filter, since that makes no sense, but you never know with PDFs...)

pdfminer / pdfminer.six

inline image parsing fails when stream data contains "EI\n" #1008