Open zkfazal opened 1 day ago
Hi @zkfazal :wave:,
Thanks for reporting.
I quickly tested it on my Linux machine without issues:
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test.py
/home/felix/Desktop/doctr/doctr/models/utils/pytorch.py:59: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state_dict = torch.load(archive_path, map_location="cpu")
Number of pages: 4
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$
import os
from tempfile import TemporaryDirectory
from PyPDF2 import PdfMerger
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from PIL import Image
from ocrmypdf.hocrtransform import HocrTransform
model = ocr_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
filename = '/home/felix/Desktop/somatosensory.pdf'
doc = DocumentFile.from_pdf(filename)
print(f"Number of pages: {len(doc)}")
# Can also do from_url and from_image
# doc = DocumentFile.from_pdf('C:/Users/Me/Downloads/Syllabus.pdf')
result = model(doc)
result.show()
# Now merge into one PDF/A file
# returns: list of tuple where the first element is the (bytes) xml string and the second is the ElementTree
xml_outputs = result.export_as_xml()
# you can also merge multiple pdfs into one
merger = PdfMerger()
with TemporaryDirectory(dir=os.getcwd()) as tmpdir:
for i, (xml, img) in enumerate(zip(xml_outputs, doc)):
# write the images temp
Image.fromarray(img).save(os.path.join(tmpdir, f"{i}.jpg"))
# write the xml content temp
with open(os.path.join(tmpdir, f"{i}.xml"),"w") as f :
f.write(xml_outputs[i][0].decode())
# Init hOCR transfomer
the_hocr_filename=os.path.join(tmpdir, f"{i}.xml")
hocr = HocrTransform(hocr_filename=the_hocr_filename, dpi=300)
# Save as PDF/A
hocr.to_pdf(out_filename=os.path.join(tmpdir, f"{i}.pdf"), image_filename=os.path.join(tmpdir, f"{i}.jpg"))
# Append to merger
merger.append(f'{tmpdir}/{i}.pdf')
# Save as combined pdf
merger.write(f'output-PDFA.pdf')
DocTR version: 0.10.1a0
TensorFlow version: 2.18.0
PyTorch version: 2.5.0 (torchvision 0.20.0)
OpenCV version: 4.10.0
OS: Ubuntu 24.04.1 LTS
Python version: 3.10.14
Is CUDA available (TensorFlow): Yes
Is CUDA available (PyTorch): Yes
CUDA runtime version: 12.6.77
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 560.35.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.5.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.5.1
I will try on monday to have a deeper look :+1:
But on a first view this looks to me like an issue with TemporaryDirectory
on your windows machine (--> PermissionError)
Thanks @felixdittrich92 for testing it on your own machine as well. What's weird is that I've used different temporary directories as well, so I don't know why it would get a permission error on multiple temporary directories. It's Windows, so I'm not as familiar with permission and ownership management as compared to Linux. I'll also tinker around with different temporary directories.
Have you tried if the issue still exists without using a temp dir ?
Bug description
Double-checked existing issues and I could not find any related to the bug I was experiencing while trying to copy the sample code for using a Jupyter notebook to generate PDF/A files from docTR output, of which the link can be found here
My main issue is the All Merged Into One PDF/A file section, where the code is like so:
I slightly modified the code to have
with TemporaryDirectory(os.getcwd())
and extracted the hocr_filename to its own variable, but these issues do not affect the problem I'm having, which is the following error: "PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\Users\Me\PycharmProjects\doctr-jupyter\tmpdbajkfk8\0.pdf'"I am currently running this in a Jupyter notebook in PyCharm. I am able to set up the docTR stuff in the initial demo (the code contained in your other sample notebook that sets up the basic model) just fine, and it can parse the text of the PDF I'm using, but once it gets to the init HOCR transform line when i=3 (so for the fourth page), it throws out this error.
The PDF I'm using is attached as a syllabus I got from a friend. This one at least gets halfway, if I use a different PDF (attached as somatosensory.pdf, a sample PDF I obtained online), it gives me a different error:
ParseError: not well-formed (invalid token): line 1, column 38224
somatosensory.pdf Syllabus.pdf
Code snippet to reproduce the bug
Error traceback
And for the PermissionError:
Environment
Collecting environment information...
DocTR version: 0.10.1a0 TensorFlow version: N/A PyTorch version: 2.5.1+cpu (torchvision 0.20.1+cpu) OpenCV version: 4.10.0 OS: Microsoft Windows 10 Pro Python version: 3.12.7 Is CUDA available (TensorFlow): N/A Is CUDA available (PyTorch): No CUDA runtime version: Could not collect GPU models and configuration: Could not collect Nvidia driver version: Could not collect cuDNN version: Could not collect
Deep Learning backend
is_tf_available: False is_torch_available: True