mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.86k stars 444 forks source link

PermissionError WinError 32, cannot access file and ParseError invalid token #1778

Open zkfazal opened 1 day ago

zkfazal commented 1 day ago

Bug description

Double-checked existing issues and I could not find any related to the bug I was experiencing while trying to copy the sample code for using a Jupyter notebook to generate PDF/A files from docTR output, of which the link can be found here

My main issue is the All Merged Into One PDF/A file section, where the code is like so:

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()

with TemporaryDirectory() as tmpdir:
    for i, (xml, img) in enumerate(zip(xml_outputs, docs)):
        # write the images temp
        Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
        # write the xml content temp
        with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
            f.write(xml_outputs[i][0].decode())
        # Init hOCR transfomer
        hocr = HocrTransform(hocr_filename=os.path.join(tmpdir, f"{i}.xml"), dpi=300)
        # Save as PDF/A
        hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
        # Append to merger
        merger.append(f'{tmpdir}/{i}.pdf')
    # Save as combined pdf
    merger.write(f'docTR-PDF.pdf')

I slightly modified the code to have with TemporaryDirectory(os.getcwd()) and extracted the hocr_filename to its own variable, but these issues do not affect the problem I'm having, which is the following error: "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Me\PycharmProjects\doctr-jupyter\tmpdbajkfk8\0.pdf'"

I am currently running this in a Jupyter notebook in PyCharm. I am able to set up the docTR stuff in the initial demo (the code contained in your other sample notebook that sets up the basic model) just fine, and it can parse the text of the PDF I'm using, but once it gets to the init HOCR transform line when i=3 (so for the fourth page), it throws out this error.

The PDF I'm using is attached as a syllabus I got from a friend. This one at least gets halfway, if I use a different PDF (attached as somatosensory.pdf, a sample PDF I obtained online), it gives me a different error: ParseError: not well-formed (invalid token): line 1, column 38224

somatosensory.pdf Syllabus.pdf

Code snippet to reproduce the bug

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
filename = 'somatosensory.pdf'
doc = DocumentFile.from_pdf(filename)
print(f"Number of pages: {len(doc)}")
# Can also do from_url and from_image
# doc = DocumentFile.from_pdf('C:/Users/Me/Downloads/Syllabus.pdf')
result = model(doc)
result.show()
# Now merge into one PDF/A file

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()

with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
    for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
        # write the images temp
        Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
        # write the xml content temp
        with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
            f.write(xml_outputs[i][0].decode())
        # Init hOCR transfomer
        the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
        hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
        # Save as PDF/A
        hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
        # Append to merger
        merger.append(f'{tmpdir}/{i}.pdf')
    # Save as combined pdf
    merger.write(f'output-PDFA.pdf')

Error traceback

Traceback (most recent call last):

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:1706 in feed
    self.parser.Parse(data, False)

ExpatError: not well-formed (invalid token): line 1, column 38224

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File ~\miniconda3\Lib\site-packages\IPython\core\interactiveshell.py:3577 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Cell In[25], line 20
    hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)

  File ~\miniconda3\Lib\site-packages\ocrmypdf\hocrtransform\_hocr.py:110 in __init__
    self.hocr = ElementTree.parse(os.fspath(hocr_filename))

  File ~\miniconda3\Lib\site-packages\defusedxml\common.py:100 in parse
    return _parse(source, parser)

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:1204 in parse
    tree.parse(source, parser)

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:572 in parse
    parser.feed(data)

  File ~\miniconda3\Lib\xml\etree\ElementTree.py:1708 in feed
    self._raiseerror(v)

 File ~\miniconda3\Lib\xml\etree\ElementTree.py:1615 in _raiseerror
    raise err

  File <string>
ParseError: not well-formed (invalid token): line 1, column 38224

And for the PermissionError:

 ---------------------------------------------------------------------------
ExpatError                                Traceback (most recent call last)
File ~\miniconda3\Lib\xml\etree\ElementTree.py:1706, in XMLParser.feed(self, data)
   1705 try:
-> 1706     self.parser.Parse(data, False)
   1707 except self._error as v:

ExpatError: not well-formed (invalid token): line 1, column 28815

During handling of the above exception, another exception occurred:

ParseError                                Traceback (most recent call last)
Cell In[27], line 20
     19 the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
---> 20 hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
     21 # Save as PDF/A

File ~\miniconda3\Lib\site-packages\ocrmypdf\hocrtransform\_hocr.py:110, in HocrTransform.__init__(self, hocr_filename, dpi, debug, fontname, font, debug_render_options)
    109 self.dpi = dpi
--> 110 self.hocr = ElementTree.parse(os.fspath(hocr_filename))
    111 self._fontname = fontname

File ~\miniconda3\Lib\site-packages\defusedxml\common.py:100, in _generate_etree_functions.<locals>.parse(source, parser, forbid_dtd, forbid_entities, forbid_external)
     94     parser = DefusedXMLParser(
     95         target=_TreeBuilder(),
     96         forbid_dtd=forbid_dtd,
     97         forbid_entities=forbid_entities,
     98         forbid_external=forbid_external,
     99     )
--> 100 return _parse(source, parser)

File ~\miniconda3\Lib\xml\etree\ElementTree.py:1204, in parse(source, parser)
   1203 tree = ElementTree()
-> 1204 tree.parse(source, parser)
   1205 return tree

File ~\miniconda3\Lib\xml\etree\ElementTree.py:572, in ElementTree.parse(self, source, parser)
    571 while data := source.read(65536):
--> 572     parser.feed(data)
    573 self._root = parser.close()

File ~\miniconda3\Lib\xml\etree\ElementTree.py:1708, in XMLParser.feed(self, data)
   1707 except self._error as v:
-> 1708     self._raiseerror(v)

File ~\miniconda3\Lib\xml\etree\ElementTree.py:1615, in XMLParser._raiseerror(self, value)
  1614 err.position = value.lineno, value.offset
-> 1615 raise err

ParseError: not well-formed (invalid token): line 1, column 28815

During handling of the above exception, another exception occurred:

PermissionError                           Traceback (most recent call last)
File ~\miniconda3\Lib\shutil.py:633, in _rmtree_unsafe(path, onexc)
    632 try:
--> 633     os.unlink(fullname)
    634 except OSError as err:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\Me\\PycharmProjects\\doctr-jupyter\\tmp9bu3mn8n\\0.pdf'

During handling of the above exception, another exception occurred:

PermissionError                           Traceback (most recent call last)
Cell In[27], line 11
      7 # you can also merge multiple pdfs into one
      9 merger = PdfMerger()
---> 11 with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
     12     for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
     13         # write the images temp
     14         Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))

File ~\miniconda3\Lib\tempfile.py:946, in TemporaryDirectory.__exit__(self, exc, value, tb)
    944 def __exit__(self, exc, value, tb):
    945     if self._delete:
--> 946         self.cleanup()

File ~\miniconda3\Lib\tempfile.py:950, in TemporaryDirectory.cleanup(self)
    948 def cleanup(self):
    949     if self._finalizer.detach() or _os.path.exists(self.name):
--> 950         self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)

File ~\miniconda3\Lib\tempfile.py:930, in TemporaryDirectory._rmtree(cls, name, ignore_errors, repeated)
    927         if not ignore_errors:
    928             raise
--> 930 _shutil.rmtree(name, onexc=onexc)

File ~\miniconda3\Lib\shutil.py:781, in rmtree(path, ignore_errors, onerror, onexc, dir_fd)
    779     # can't continue even if onexc hook returns
    780     return
--> 781 return _rmtree_unsafe(path, onexc)

File ~\miniconda3\Lib\shutil.py:635, in _rmtree_unsafe(path, onexc)
    633             os.unlink(fullname)
    634         except OSError as err:
--> 635             onexc(os.unlink, fullname, err)
636 try:
    637     os.rmdir(path)

File ~\miniconda3\Lib\tempfile.py:905, in TemporaryDirectory._rmtree.<locals>.onexc(func, path, exc)
    902 _resetperms(path)
    904 try:
--> 905     _os.unlink(path)
    906 except IsADirectoryError:
    907     cls._rmtree(path, ignore_errors=ignore_errors)

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\Me\\PycharmProjects\\doctr-jupyter\\tmp9bu3mn8n\\0.pdf'

Environment

Collecting environment information...

DocTR version: 0.10.1a0 TensorFlow version: N/A PyTorch version: 2.5.1+cpu (torchvision 0.20.1+cpu) OpenCV version: 4.10.0 OS: Microsoft Windows 10 Pro Python version: 3.12.7 Is CUDA available (TensorFlow): N/A Is CUDA available (PyTorch): No CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect

Deep Learning backend

is_tf_available: False is_torch_available: True

felixdittrich92 commented 1 day ago

Hi @zkfazal :wave:,

Thanks for reporting.

I quickly tested it on my Linux machine without issues:

(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test.py
/home/felix/Desktop/doctr/doctr/models/utils/pytorch.py:59: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(archive_path, map_location="cpu")
Number of pages: 4
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ 
import os
from tempfile import TemporaryDirectory

from PyPDF2 import PdfMerger
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from PIL import Image
from ocrmypdf.hocrtransform import HocrTransform

model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
filename = '/home/felix/Desktop/somatosensory.pdf'
doc = DocumentFile.from_pdf(filename)
print(f"Number of pages: {len(doc)}")
# Can also do from_url and from_image
# doc = DocumentFile.from_pdf('C:/Users/Me/Downloads/Syllabus.pdf')
result = model(doc)
result.show()
# Now merge into one PDF/A file

# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()

# you can also merge multiple pdfs into one

merger = PdfMerger()

with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
    for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
        # write the images temp
        Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
        # write the xml content temp
        with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
            f.write(xml_outputs[i][0].decode())
        # Init hOCR transfomer
        the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
        hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
        # Save as PDF/A
        hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
        # Append to merger
        merger.append(f'{tmpdir}/{i}.pdf')
    # Save as combined pdf
    merger.write(f'output-PDFA.pdf')

output-PDFA.pdf

DocTR version: 0.10.1a0
TensorFlow version: 2.18.0
PyTorch version: 2.5.0 (torchvision 0.20.0)
OpenCV version: 4.10.0
OS: Ubuntu 24.04.1 LTS
Python version: 3.10.14
Is CUDA available (TensorFlow): Yes
Is CUDA available (PyTorch): Yes
CUDA runtime version: 12.6.77
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 560.35.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1

I will try on monday to have a deeper look :+1:

But on a first view this looks to me like an issue with TemporaryDirectory on your windows machine (--> PermissionError)

zkfazal commented 1 day ago

Thanks @felixdittrich92 for testing it on your own machine as well. What's weird is that I've used different temporary directories as well, so I don't know why it would get a permission error on multiple temporary directories. It's Windows, so I'm not as familiar with permission and ownership management as compared to Linux. I'll also tinker around with different temporary directories.

felixdittrich92 commented 17 hours ago

Have you tried if the issue still exists without using a temp dir ?