I have been troubleshooting a significant performance issue using PDFMiner to extract text from certain utility bills. While investigating, I attempted to use cProfile on pdf2txt.py to see what was going on:
python -m cProfile ./tools/pdf2txt.py any-pdf.pdf
Results:
Any pdf extracted text
Traceback (most recent call last):
File "C:\Python39\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Python39\lib\runpy.py", line 87, in _run_code
exec(code, run_globals)
File "C:\Python39\lib\cProfile.py", line 190, in <module>
main()
File "C:\Python39\lib\cProfile.py", line 179, in main
runctx(code, globs, None, options.outfile, options.sort)
File "C:\Python39\lib\cProfile.py", line 19, in runctx
return _pyprofile._Utils(Profile).runctx(statement, globals, locals,
File "C:\Python39\lib\profile.py", line 66, in runctx
self._show(prof, filename, sort)
File "C:\Python39\lib\profile.py", line 72, in _show
prof.print_stats(sort)
File "C:\Python39\lib\cProfile.py", line 42, in print_stats
pstats.Stats(self).strip_dirs().sort_stats(sort).print_stats()
File "C:\Python39\lib\pstats.py", line 422, in print_stats
print(indent, self.total_calls, "function calls", end=' ', file=self.stream)
ValueError: I/O operation on closed file.
I went looking in pdf2txt.py for a reason this might happen, and it was fairly obvious. In extract_text() line 53:
if outfile == "-":
outfp: AnyIO = sys.stdout
if sys.stdout.encoding is not None:
codec = "utf-8"
else:
outfp = open(outfile, "wb")
So at this point outfp is a reference to STDOUT. The extract_text() function then returns outfp to the caller. The caller is function main() on line 305:
In my case, the line outfp.close() is closing the STDOUT stream. cProfile is attempting to write the profile results to that stream, and thus the ValueError is raised for trying to write to a closed stream.
I have been troubleshooting a significant performance issue using PDFMiner to extract text from certain utility bills. While investigating, I attempted to use cProfile on pdf2txt.py to see what was going on:
python -m cProfile ./tools/pdf2txt.py any-pdf.pdf
Results:
I went looking in pdf2txt.py for a reason this might happen, and it was fairly obvious. In
extract_text()
line 53:So at this point
outfp
is a reference to STDOUT. Theextract_text()
function then returnsoutfp
to the caller. The caller is functionmain()
on line 305:In my case, the line
outfp.close()
is closing the STDOUT stream. cProfile is attempting to write the profile results to that stream, and thus theValueError
is raised for trying to write to a closed stream.As a workaround, I modified the function to:
This allowed me to run the profiler without issue.