Closed dhdaines closed 2 months ago
I can reproduce this with:
$ python tools/pdf2txt.py 222-2008-zonage-annexe-c-carte-25b-innond.pdf
Traceback (most recent call last):
File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 318, in <module>
sys.exit(main())
File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 312, in main
outfp = extract_text(**vars(parsed_args))
File "/home/pieter/projects/pdfminer-upstream/tools/pdf2txt.py", line 63, in extract_text
pdfminer.high_level.extract_text_to_fp(fp, **locals())
File "/home/pieter/projects/pdfminer-upstream/pdfminer/high_level.py", line 133, in extract_text_to_fp
interpreter.process_page(page)
File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 997, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)
File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1016, in render_contents
self.execute(list_value(streams))
File "/home/pieter/projects/pdfminer-upstream/pdfminer/pdfinterp.py", line 1027, in execute
(_, obj) = parser.nextobject()
File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 601, in nextobject
(pos, token) = self.nexttoken()
File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 518, in nexttoken
self.charpos = self._parse1(self.buf, self.charpos)
File "/home/pieter/projects/pdfminer-upstream/pdfminer/psparser.py", line 467, in _parse_string_1
self._curtoken += bytes((int(self.oct, 8),))
ValueError: bytes must be in range(0, 256)
Yes - as you note in the PR, it is when /A85 is the first filter in the list, for obvious reasons (it is very unlikely that it wouldn't be the first filter, since that makes no sense, but you never know with PDFs...)
In the case where an inline image uses the ASCII85Decode filter, it can (and frequently does) contain the sequence "EI" internally at the end of a line. This confuses the pdf parser and can lead to a variety of weird symptoms, since it will attempt to parse the rest of the stream data as the rest of the containing content stream, which it most definitely is not. I just happened to notice this because the attached PDF has this problem (the symptom in particular is that it comes across the sequence
"\611"
which is a very invalid octal escape):222-2008-zonage-annexe-c-carte-25b-innond.pdf
The PDF spec is not tremendously helpful, but once you realize that an ASCII85Decode stream must end with
"~>"
it's obvious that we should be looking for that, and not "EI" followed by whitespace, in the case where this encoding is used. This should be as simple as checking if/A85
is in the image dictionary and then passing"~>"
instead of "EI" toPDFContentParser.get_inline_data
. I'll make a PR.