pdfminer / pdfminer.six

Community maintained fork of pdfminer - we fathom PDF
https://pdfminersix.readthedocs.io
MIT License
5.94k stars 930 forks source link

Using cProfile with pdf2txt.py raises ValueError before writing profile results. #717

Open dwittal opened 2 years ago

dwittal commented 2 years ago

I have been troubleshooting a significant performance issue using PDFMiner to extract text from certain utility bills. While investigating, I attempted to use cProfile on pdf2txt.py to see what was going on:

python -m cProfile ./tools/pdf2txt.py any-pdf.pdf

Results:

Any pdf extracted text
Traceback (most recent call last):
  File "C:\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Python39\lib\cProfile.py", line 190, in <module>
    main()
  File "C:\Python39\lib\cProfile.py", line 179, in main
    runctx(code, globs, None, options.outfile, options.sort)
  File "C:\Python39\lib\cProfile.py", line 19, in runctx
    return _pyprofile._Utils(Profile).runctx(statement, globals, locals,
  File "C:\Python39\lib\profile.py", line 66, in runctx
    self._show(prof, filename, sort)
  File "C:\Python39\lib\profile.py", line 72, in _show
    prof.print_stats(sort)
  File "C:\Python39\lib\cProfile.py", line 42, in print_stats
    pstats.Stats(self).strip_dirs().sort_stats(sort).print_stats()
  File "C:\Python39\lib\pstats.py", line 422, in print_stats
    print(indent, self.total_calls, "function calls", end=' ', file=self.stream)
ValueError: I/O operation on closed file.

I went looking in pdf2txt.py for a reason this might happen, and it was fairly obvious. In extract_text() line 53:

    if outfile == "-":
        outfp: AnyIO = sys.stdout
        if sys.stdout.encoding is not None:
            codec = "utf-8"
    else:
        outfp = open(outfile, "wb")

So at this point outfp is a reference to STDOUT. The extract_text() function then returns outfp to the caller. The caller is function main() on line 305:

def main(args: Optional[List[str]] = None) -> int:
    parsed_args = parse_args(args)
    outfp = extract_text(**vars(parsed_args))
    outfp.close()
    return 0

In my case, the line outfp.close() is closing the STDOUT stream. cProfile is attempting to write the profile results to that stream, and thus the ValueError is raised for trying to write to a closed stream.

As a workaround, I modified the function to:

def main(args: Optional[List[str]] = None) -> int:
    parsed_args = parse_args(args)
    outfp = extract_text(**vars(parsed_args))
    if outfp != sys.stdout:
        outfp.close()
    return 0

This allowed me to run the profiler without issue.

pietermarsman commented 2 years ago

Agreed. PR with the suggested change is welcome here.