Closed justinjosephmkj closed 3 weeks ago
Hi @justinjosephmkj :wave:,
Thanks for the PR :)
But i think that's already solved could you please check if upgrading pypdfium2
solves your issue ?
https://github.com/pypdfium2-team/pypdfium2/issues/281
CC @mara004
@justinjosephmkj Which version of pypdfium2 are you using?
Is there any caller-side multiprocessing around DocumentFile.from_pdf()
?
That said, explicitly closing the PdfDocument might be advantageous, to immediately release the file handle, and should inherently prevent the Cannot close object, library is destroyed
issue, for whatever reason it may be happening here.
However, I think it can be done more easily than this PR. Just wrap the current return ...
in
try:
return ...
finally:
pdf.close()
Closing a parent object implicitly closes child objects (e.g. page handles) since v4.11, so this should be safe with respect to the order of closing. I will also re-introduce context manager functionality for this reason.[^1]
[^1]: At the time where closing order was unprotected, I decided not to provide the with
-block because it tempted callers to close the PDF, but not its child objects. (When only using garbage collection, order is ensured automatically by weakref.finalize()
API contract.) Anyway, this is no longer an issue.
(@felixdittrich92 Actually, I doubt whether updating pypdfium2 will help, because the reporter says to be using doctr v0.8.1
, which does linear rendering in the first place, so https://github.com/pypdfium2-team/pypdfium2/issues/281 isn't really applicable, unless the caller is doing multiprocessing.)
Hi @justinjosephmkj 👋,
Thanks for the PR :)
But i think that's already solved could you please check if upgrading
pypdfium2
solves your issue ?CC @mara004
Hi, I am using Name: pypdfium2 Version: 4.28.0
@mara004 Version: 4.28.0,
try:
pdf = pdfium.PdfDocument(file, password=password, autoclose=True)
return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
finally:
pdf.close()
This is also solving the issue.
I am not using multiprocessing; I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.
I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.
Please share this file, if possible – that would be a valuable testing item for us to investigate the issue. I'm afraid I can't debug this unless I'm able to reproduce the issue.
Version: 4.28.0
You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue.
@justinjosephmkj Another question: can you confirm whether this is happening just before the process exits?
@mara004
I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.
Please share this file, if possible – that would be a valuable testing item for us to investigate the issue. I'm afraid I can't debug this unless I'm able to reproduce the issue.
Unfortunately I wont be able to share the file, will try to see if I can find a file that might cause this issue from the internet.
Version: 4.28.0
You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue.
I think the same
Another question: can you confirm whether this is happening just before the process exits?
Yes, this happens right before the process exits for a single file. I am getting the output that I am looking for but it ends with this error message. For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.
For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.
Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order? In particular, what precisely do you mean by the two mentions of "the code breaks" (e.g. what error do you see) ?
That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).
What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.
That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).
What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.
For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.
Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order? In particular, what precisely do you mean by the two mentions of "the code breaks" (e.g. what error do you see) ?
@mara004
I have a script that processes 100s of files one by one. However, the issue that I am encountering occurs randomly with one of these files, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed.
The code comes to halt with this message
Cannot close object, library is destroyed
That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).
What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.
@mara004 I am using a loop to iteratively process multiple files. However, during runtime, the message 'Cannot close object, library is destroyed' appears, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed.
Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt?
Again, a reproducible example (code & pdf(s)) would be needed to investigate.
@felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ?
Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt?
Again, a reproducible example (code & pdf(s)) would be needed to investigate.
@felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ?
@mara004
Yes the closing patch fixed the issue.
FWIW, whether this error is shown or not is controlled by a conditional variable LIBRARY_AVAILABLE
, which is a mutable bool (ctypes.c_bool()
).
This is only ever set at the following 3 places: 0, 1, 2.
I don't see how this could go wrong... – unless maybe something has messed with the memory storing the bool?
Hi @justinjosephmkj @mara004 :wave:,
Soooo
@justinjosephmkj For the scope of this PR i would be fine with @mara004 suggestion:
pdf = pdfium.PdfDocument(file, password=password)
# Workaround ref.: https://github.com/mindee/doctr/issues/1623
try:
return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
finally:
pdf.close()
@justinjosephmkj @mara004 I was not able to reproduce any mem leak but the message appears if:
case 2:
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test_scripts/test_pdf.py
^CTraceback (most recent call last):
File "/home/felix/Desktop/doctr/test_scripts/test_pdf.py", line 36, in <module>
doc = DocumentFile.from_pdf(path)
File "/home/felix/Desktop/doctr/doctr/io/reader.py", line 40, in from_pdf
return read_pdf(file, **kwargs)
File "/home/felix/Desktop/doctr/doctr/io/pdf.py", line 42, in read_pdf
return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
File "/home/felix/Desktop/doctr/doctr/io/pdf.py", line 42, in <listcomp>
return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
File "/home/felix/anaconda3/envs/doctr-dev/lib/python3.10/site-packages/pypdfium2/_helpers/page.py", line 428, in render
pdfium_c.FPDF_RenderPageBitmap(*render_args)
KeyboardInterrupt
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
case 1:
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:12<00:00, 2.46s/it]
-> Cannot close object, library is destroyed. This may cause a memory leak!
script to reproduce (doctr main branch):
import time
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from tqdm import tqdm
paths = [
"/home/felix/Desktop/2305.02577v1.pdf",
"/home/felix/Desktop/602.pdf",
"/home/felix/Desktop/1262.pdf",
"/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf",
"/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf",
] * 1000
model = (
ocr_predictor(
det_arch="fast_base",
reco_arch="parseq",
pretrained=True,
det_bs=8,
reco_bs=1024,
)
.cuda() # remove without GPU
.half() # remove without GPU
)
for path in tqdm(paths, desc="Processing", total=len(paths)):
start = time.time()
doc = DocumentFile.from_pdf(path)
pdf_len = len(doc)
result = model(doc)
json_res = result.export()
res_len = len(json_res["pages"])
print(time.time() - start)
assert pdf_len == res_len, f"pdf_len: {pdf_len}, res_len: {res_len}"
memory was ~11 - 13 GB usage - no peaks (releasing / garabage collection looks correct)
data is attached :) data_share.zip
with the code change the message disappeared everything else was the same on my end
Firstly, thanks for looking into this @felixdittrich92. Unfortunately, I didn't manage to get your sample running yet - fails with some attribute error (see attachment): log.txt (I tried to use tensorflow rather than pytorch because the latter is so big with all its dependencies.) A slightly more minimalistic example, using only pypdfium2 itself, would be great...
I was not able to reproduce any mem leak but the message appears if:
- the last file is finally processed
- trough a keyboard interrupt
As long as it's only on before exit, that should be harmless, as stated already in https://github.com/mindee/doctr/pull/1624#issuecomment-2143576037
However, that sounds like atexit.register()
does not promise that exit handlers be called after garbage object finalizers.
Anyway, I would probably just leave things as-is in this regard, because it doesn't seem worth installing a global object weakref cache to walk through before destroying the library...
Unfortunately, I didn't manage to get your sample running yet - fails with some attribute error
Ah, I just had a mismatched tensorflow version. Doing pip3 install -e .[tf]
rather than pip3 install -e .; pip3 install tensorflow
fixed it 😅 Retrying now...
Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ...
Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ...
I assume you are running on CPU ? ^^
You should be able to reproduce it also with only 1 file (for example 602.pdf
has only 3 pages)
I assume you are running on CPU ? ^^
Yes.
You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)
That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message.....
I assume you are running on CPU ? ^^
Yes.
You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)
That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message.....
Strange, i tested it also again with each single file .. no message .. tested with several bunchs .. no message .. tested again with all 5 file .. message appears xD
@mara004 script only to test the pypdfium2 part:
import time
from tqdm import tqdm
from doctr.io import DocumentFile
paths = [
"/home/felix/Desktop/2305.02577v1.pdf",
"/home/felix/Desktop/602.pdf",
"/home/felix/Desktop/1262.pdf",
"/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf",
"/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf",
]
for path in tqdm(paths, desc="Processing", total=len(paths)):
start = time.time()
doc = DocumentFile.from_pdf(path)
pdf_len = len(doc)
print(len(doc))
print(time.time() - start)
The message appears for each page in the documents now... looks to me like with autoclose=True
the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?
By using the ocr_predictor
you have some delay that seems to be the reason why it doesn't appears all the time
for example if i process with this script only "/home/felix/Desktop/602.pdf",
which has 4 pages the message appears 4 times ^^
I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much...
@felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend
import pypdfium2.internal as pdfium_i
pdfium_i.DEBUG_AUTOCLOSE.value = True
and attach the output of a run that produced the error?
The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?
autoclose
unfortunately is misleading. It just means, if the input is a byte buffer, to close that buffer when the document is closed (whether automatically or explicitly). Rather obsolete when I come to think of it; perhaps I should just remove that param...
I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much...
@felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend
import pypdfium2.internal as pdfium_i pdfium_i.DEBUG_AUTOCLOSE.value = True
and attach the output of a run that produced the error?
The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?
autoclose
unfortunately is misleading. It just means, if the input is a byte buffer, to close that buffer when the document is closed (whether automatically or explicitly). Rather obsolete when I come to think of it; perhaps I should just remove that param...
Here you go :)
Processing: 0%| | 0/5 [00:00<?, ?it/s]19
0.7914893627166748
Processing: 20%|█████████████████████████████████▍ | 1/5 [00:00<00:03, 1.26it/s]3
0.02114415168762207
6
0.036025047302246094
Close (auto) <PdfPage uuid:333e76e7>
Close (auto) <PdfPage uuid:4b3cc566>
Close (auto) <PdfPage uuid:b706c0f4>
Close (auto) <PdfPage uuid:f9093a3d>
Close (auto) <PdfPage uuid:8e615b76>
Close (auto) <PdfPage uuid:4dd695af>
Close (auto) <PdfPage uuid:233a80ea>
Close (auto) <PdfPage uuid:0de4c8e0>
Close (auto) <PdfPage uuid:8394319f>
Close (auto) <PdfPage uuid:d2deeca4>
Close (auto) <PdfPage uuid:01602212>
Close (auto) <PdfPage uuid:4ea048f9>
Close (auto) <PdfPage uuid:78a2e272>
Close (auto) <PdfPage uuid:6d8fe051>
Close (auto) <PdfPage uuid:067c7284>
Close (auto) <PdfPage uuid:51e1682b>
Close (auto) <PdfPage uuid:267dd5af>
Close (auto) <PdfPage uuid:901b5056>
Close (auto) <PdfPage uuid:25733567>
Close (auto) <PdfPage uuid:9437fe80>
Close (auto) <PdfPage uuid:1bc110fd>
Close (auto) <PdfPage uuid:0bd464e8>
Close (auto) <PdfPage uuid:53a3c703>
Close (auto) <PdfPage uuid:e5dab713>
Close (auto) <PdfPage uuid:95aeb5aa>
Close (auto) <PdfPage uuid:4afdcda9>
Close (auto) <PdfPage uuid:fac8b29a>
Close (auto) <PdfPage uuid:771244dc>
Close (auto) <PdfPage uuid:597d6e69>
Close (auto) <PdfPage uuid:8ea275cb>
Close (auto) <PdfPage uuid:688eb2f6>
Close (auto) <PdfPage uuid:4e298037>
Close (auto) <PdfPage uuid:f898e06c>
20
0.13844823837280273
Processing: 80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 4/5 [00:00<00:00, 4.93it/s]16
0.09407329559326172
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 4.62it/s]
Close (auto) <PdfPage uuid:e3c3896b>
Close (auto) <PdfPage uuid:29e9deff>
Close (auto) <PdfPage uuid:db94fb58>
Close (auto) <PdfPage uuid:1097da0b>
Close (auto) <PdfPage uuid:0ba7b27f>
Close (auto) <PdfPage uuid:c83f918b>
Close (auto) <PdfPage uuid:f4b33617>
Close (auto) <PdfPage uuid:0933d63b>
Close (auto) <PdfPage uuid:e737b1f5>
Close (auto) <PdfPage uuid:8e84a54e>
Close (auto) <PdfPage uuid:cb49229e>
Close (auto) <PdfPage uuid:83ed8321>
Close (auto) <PdfPage uuid:0476ffdf>
Close (auto) <PdfPage uuid:5ee10383>
Close (auto) <PdfPage uuid:0edbe852>
Close (auto) <PdfPage uuid:3c450b88>
Close (auto) <PdfPage uuid:bc3f33f0>
Close (auto) <PdfPage uuid:954516f5>
Close (auto) <PdfPage uuid:081b092d>
Close (auto) <PdfPage uuid:ddd88b13>
Close (auto) <PdfPage uuid:86c9d1f2>
Close (auto) <PdfPage uuid:c2c96fe6>
Close (auto) <PdfPage uuid:f2bb8611>
Close (auto) <PdfPage uuid:9bf25db7>
Close (auto) <PdfPage uuid:7da65012>
Close (auto) <PdfPage uuid:bb4188a4>
Close (auto) <PdfPage uuid:9ea12db3>
Close (auto) <PdfPage uuid:6203ce20>
Close (auto) <PdfPage uuid:c47cb474>
Close (auto) <PdfPage uuid:daafbdb6>
Close (auto) <PdfDocument uuid:60b0a002 from '/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf'>
Close (auto) <PdfPage uuid:ba5a1396>
Close (auto) <PdfDocument uuid:f3566c65 from '/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf'>
Close (auto) <PdfDocument uuid:125811aa from '/home/felix/Desktop/1262.pdf'>
Close (auto) <PdfDocument uuid:d0aa77e1 from '/home/felix/Desktop/602.pdf'>
Close (auto) <PdfDocument uuid:4b5964a9 from '/home/felix/Desktop/2305.02577v1.pdf'>
Destroy PDFium (auto)
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test_scripts/test_pdf.py
Processing: 0%| | 0/5 [00:00<?, ?it/s]19
0.7971949577331543
Processing: 20%|███████████████████████████████████▊ | 1/5 [00:00<00:03, 1.25it/s]3
0.021633386611938477
6
0.03635716438293457
20
0.14321637153625488
Processing: 80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 4/5 [00:00<00:00, 4.87it/s]16
0.09462189674377441
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00, 4.57it/s]
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$
@mara004 yeah :sweat_smile: autoclose sounds to me like a context manager under the hood which takes the job to close the file
@felixdittrich92 Erm, concerning the output:
DEBUG_AUTOCLOSE
? Or did you, but there were no debug messages?autoclose sounds to me like a context manager under the hood which takes the job to close the file
That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly.
I agree that naming was unfortunate. I will probably remove autoclose
in v5 and just ask the caller to close the buffer directly rather than via PdfDocument param.
@felixdittrich92 Erm, concerning the output:
- This is from two separate invocations, right?
- In the first one, you added the debug info, but the error did not appear.
- In the second one, where the error appeared, it seems like you may have forgotten to add the
DEBUG_AUTOCLOSE
? Or did you, but there were no debug messages?autoclose sounds to me like a context manager under the hood which takes the job to close the file
That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly. I agree that naming was unfortunate. I will probably remove
autoclose
in v5 and just ask the caller to close the buffer directly rather than via PdfDocument param.
Ah, I see. That's unfortunate, as what I am interested in is, of course, the error with debug info, because both alone is sort of useless. I am guessing this is again simply due to the non-deterministic nature of garbage collection. Feel free to report back if anyone manages to get the error with debug info.
Alright :+1:
Now i'm struggling a bit with the scope of this PR, because i agree that the message is a bit scary but it doesn't looks like there is any memory leak on our / on pypdfium2 side .. so the only change would be that the message disappears.
Could you share some profiling logs with and without your changes please @justinjosephmkj ?
i agree that the message is a bit scary but it doesn't looks like there is any memory leak on our / on pypdfium2 side
Yes, I also intend to clarify the message upstream that this is harmless if it happens only on process exit.
In any case, the explicit closing as in https://github.com/mindee/doctr/pull/1624#issuecomment-2143818802 should be fine, provided you require pypdfium2 >= 4.11
.
@felixdittrich92 @mara004 Thank you both for looking into this issue, I have updated Name: pypdfium2 to Version: 4.30.0. Now the message
-> Cannot close object, library is destroyed. This may cause a memory leak!
only shows up right before the process exits.
This will give the message for all pdf files.
gc.set_threshold(10, 1, 1)
The issue not causing an memory leak as of now but the message is coming up.
I have updated Name: pypdfium2 to Version: 4.30.0. Now the message [...] only shows up right before the process exits.
As you can see yourself, there is no related change between 4.28...4.30
, so any difference in runtime closing behavior would be coincidence.
gc.set_threshold(10, 1, 1)
FWIW, I still do not get the message despite adding this, with Fedora-provided Python 3.11.6.
That said, I looked into the weakref module again and found the finalizer atexit
attribute.
From theoretical considerations I think setting this to False should make the message disappear – can you test the following patch?
diff --git a/src/pypdfium2/internal/bases.py b/src/pypdfium2/internal/bases.py
index af34c30a..4f14ce73 100644
--- a/src/pypdfium2/internal/bases.py
+++ b/src/pypdfium2/internal/bases.py
@@ -72,6 +72,7 @@ class AutoCloseable (AutoCastable):
# NOTE this function captures the value of the `parent` property at finalizer installation time - if it changes, detach the old finalizer and create a new one
assert self._finalizer is None
self._finalizer = weakref.finalize(self._obj, _close_template, self._close_func, self.raw, repr(self), self._autoclose_state, self.parent, *self._ex_args, **self._ex_kwargs)
+ self._finalizer.atexit = False
def _detach_finalizer(self):
self._finalizer.detach()
^ The only problem I see with this patch is, if you embed a Python interpreter in another process that keeps running after the interpreter, not invoking finalizers at interpreter shutdown would cause an actual memory leak. Same goes if you get the "cannot close object" warning in an embedded interpreter. But as long as Python is hosted in an own process, it should be fine...
self._finalizer.atexit = False
pypdfium2 4.30.0+10.gff2ff4b.dirty pypi_0 pypi
Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)
Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)
That's very weird... You don't have a conflicting build of pypdfium2 installed, by any chance? like from conda?
Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)
That's very weird... You don't have a conflicting build of pypdfium2 installed, by any chance? like from conda?
Nope cloned your repo and installed it in editable mode to apply your suggestion :)
@justinjosephmkj Could you update your PR with the suggestion from https://github.com/mindee/doctr/pull/1624#issuecomment-2143818802 please ? :)
And update
https://github.com/mindee/doctr/blob/674a875db8339ff7ad58e3b25441343ae3d01bfe/pyproject.toml#L41
to "pypdfium2>=4.11.0,<5.0.0",
same here:
https://github.com/mindee/doctr/blob/674a875db8339ff7ad58e3b25441343ae3d01bfe/.conda/meta.yaml#L28
Thanks :)
@felixdittrich92 I have updated the code in the PR.
@justinjosephmkj Could you please update the version identifier as mentioned in https://github.com/mindee/doctr/pull/1624#issuecomment-2154150716
also ? And remove the autoclose=True
? :)
@felixdittrich92 Updated.
@justinjosephmkj Could you please update the version identifier as mentioned in #1624 (comment) also ? And remove the
autoclose=True
? :)
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 96.42%. Comparing base (
9c92f5c
) to head (07583e4
). Report is 6 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I'm still confused about what may be the cause of this issue, why .atexit = False
reportedly does not prevent it, and why I can't reproduce.
After all, pypdfium2 is simply a caller of atexit.register() / weakref.finalize(), and if that's just how these APIs end up (mis-)behaving for some users, I don't see what we could do about that from our side...
Is there any chance this happens only with specific versions, or builds, of Python?
@mara004 which OS do you use ? Mh tried it on python 3.10 and 3.11 same thing so i don't think so :sweat_smile:
which OS do you use ?
I'm on Fedora 37 with system-provided python 3.11.6. Same question to you (which OS?) ;) And are you using a linux distro build of python, an official binary from python.org, or a conda-provided python?
which OS do you use ?
I'm on Fedora 37 with system-provided python 3.11.6. Same question to you (which OS?) ;) And are you using a linux distro build of python, an official binary from python.org, or a conda-provided python?
Ubuntu 22.04 LTS and python from conda. So in this case it would be close to test it in a container again ^^
Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce. I'll test on my friends' Ubuntu 20 and 22 devices at some point.
Also, pypdfium2's own test suite does not show the issue on GH actions.
Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce. I'll test on my friends' Ubuntu 20 and 22 devices at some point.
Also, pypdfium2's own test suite does not show the issue on GH actions.
Strange :sweat_smile:
Fix for the issue #1623