mindee / doctr

docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
https://mindee.github.io/doctr/
Apache License 2.0
3.31k stars 397 forks source link

[IO] Pdf File close after opening #1624

Closed justinjosephmkj closed 3 weeks ago

justinjosephmkj commented 1 month ago

Fix for the issue #1623

felixdittrich92 commented 1 month ago

Hi @justinjosephmkj :wave:,

Thanks for the PR :)

But i think that's already solved could you please check if upgrading pypdfium2 solves your issue ?

https://github.com/pypdfium2-team/pypdfium2/issues/281

CC @mara004

mara004 commented 1 month ago

@justinjosephmkj Which version of pypdfium2 are you using? Is there any caller-side multiprocessing around DocumentFile.from_pdf()?

That said, explicitly closing the PdfDocument might be advantageous, to immediately release the file handle, and should inherently prevent the Cannot close object, library is destroyed issue, for whatever reason it may be happening here.

However, I think it can be done more easily than this PR. Just wrap the current return ... in

try:
    return ...
finally:
    pdf.close()

Closing a parent object implicitly closes child objects (e.g. page handles) since v4.11, so this should be safe with respect to the order of closing. I will also re-introduce context manager functionality for this reason.[^1]

[^1]: At the time where closing order was unprotected, I decided not to provide the with-block because it tempted callers to close the PDF, but not its child objects. (When only using garbage collection, order is ensured automatically by weakref.finalize() API contract.) Anyway, this is no longer an issue.

mara004 commented 1 month ago

(@felixdittrich92 Actually, I doubt whether updating pypdfium2 will help, because the reporter says to be using doctr v0.8.1, which does linear rendering in the first place, so https://github.com/pypdfium2-team/pypdfium2/issues/281 isn't really applicable, unless the caller is doing multiprocessing.)

justinjosephmkj commented 1 month ago

Hi @justinjosephmkj 👋,

Thanks for the PR :)

But i think that's already solved could you please check if upgrading pypdfium2 solves your issue ?

pypdfium2-team/pypdfium2#281

CC @mara004

Hi, I am using Name: pypdfium2 Version: 4.28.0

@mara004 Version: 4.28.0,

    try:
        pdf = pdfium.PdfDocument(file, password=password, autoclose=True)
        return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
    finally:
        pdf.close()

This is also solving the issue.

I am not using multiprocessing; I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.

mara004 commented 1 month ago

I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.

Please share this file, if possible – that would be a valuable testing item for us to investigate the issue. I'm afraid I can't debug this unless I'm able to reproduce the issue.

Version: 4.28.0

You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue.

mara004 commented 1 month ago

@justinjosephmkj Another question: can you confirm whether this is happening just before the process exits?

justinjosephmkj commented 1 month ago

@mara004

I was using iteration for all the files but found a single file for which I was able to recreate the issue every time I ran the script.

Please share this file, if possible – that would be a valuable testing item for us to investigate the issue. I'm afraid I can't debug this unless I'm able to reproduce the issue.

Unfortunately I wont be able to share the file, will try to see if I can find a file that might cause this issue from the internet.

Version: 4.28.0

You're encouraged to update to latest (v4.30), but I don't think it will resolve the issue.

I think the same

Another question: can you confirm whether this is happening just before the process exits?

Yes, this happens right before the process exits for a single file. I am getting the output that I am looking for but it ends with this error message. For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.

mara004 commented 1 month ago

For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.

Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order? In particular, what precisely do you mean by the two mentions of "the code breaks" (e.g. what error do you see) ?

mara004 commented 1 month ago

That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).

What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.

justinjosephmkj commented 1 month ago

That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).

What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.

For a batch of files, the code breaks in between but its giving the output for the last opened pdf file and then this error shows up and the code breaks which stops further execution of the rest of the batch.

Sorry, I don't understand that yet. Can you re-phrase? What exactly is happening, and in which order? In particular, what precisely do you mean by the two mentions of "the code breaks" (e.g. what error do you see) ?

@mara004

I have a script that processes 100s of files one by one. However, the issue that I am encountering occurs randomly with one of these files, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed.

The code comes to halt with this message

Cannot close object, library is destroyed
justinjosephmkj commented 1 month ago

That said, I figured that, if you see this message only on process exit, it should be harmless. It would just mean that the exit handler to destroy pdfium hit while there were still objects that hadn't been garbage collected / finalized yet. In that case, we're not allowed to call the object closing APIs anymore, but we also don't have to worry about the leak if the process is about to terminate anyway. However, the strange thing is that I believed python would call exit handlers only after all garbage objects have been finalized, and I never had this error yet (nor most other pypdfium2 users, it seems).

What would be bad is if you were getting this midway at runtime or something, because pdfium should stay alive until process exit.

@mara004 I am using a loop to iteratively process multiple files. However, during runtime, the message 'Cannot close object, library is destroyed' appears, causing the script to halt prematurely. As a result, the remaining files in the batch are not processed.

mara004 commented 1 month ago

Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt?

Again, a reproducible example (code & pdf(s)) would be needed to investigate.

@felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ?

justinjosephmkj commented 1 month ago

Hoom, that's strange, but thanks for clarifying. And the closing patch actually fixes this halt?

Again, a reproducible example (code & pdf(s)) would be needed to investigate.

@felixdittrich92 Did you ever encounter any such behavior as described by @justinjosephmkj ?

@mara004

Yes the closing patch fixed the issue.

mara004 commented 1 month ago

FWIW, whether this error is shown or not is controlled by a conditional variable LIBRARY_AVAILABLE, which is a mutable bool (ctypes.c_bool()). This is only ever set at the following 3 places: 0, 1, 2. I don't see how this could go wrong... – unless maybe something has messed with the memory storing the bool?

felixdittrich92 commented 1 month ago

Hi @justinjosephmkj @mara004 :wave:,

Soooo

@justinjosephmkj For the scope of this PR i would be fine with @mara004 suggestion:

pdf = pdfium.PdfDocument(file, password=password)
# Workaround ref.: https://github.com/mindee/doctr/issues/1623
try:
    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
finally:
    pdf.close()

@justinjosephmkj @mara004 I was not able to reproduce any mem leak but the message appears if:

  1. the last file is finally processed
  2. trough a keyboard interrupt

case 2:

(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test_scripts/test_pdf.py
^CTraceback (most recent call last):
  File "/home/felix/Desktop/doctr/test_scripts/test_pdf.py", line 36, in <module>
    doc = DocumentFile.from_pdf(path)
  File "/home/felix/Desktop/doctr/doctr/io/reader.py", line 40, in from_pdf
    return read_pdf(file, **kwargs)
  File "/home/felix/Desktop/doctr/doctr/io/pdf.py", line 42, in read_pdf
    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
  File "/home/felix/Desktop/doctr/doctr/io/pdf.py", line 42, in <listcomp>
    return [page.render(scale=scale, rev_byteorder=rgb_mode, **kwargs).to_numpy() for page in pdf]
  File "/home/felix/anaconda3/envs/doctr-dev/lib/python3.10/site-packages/pypdfium2/_helpers/page.py", line 428, in render
    pdfium_c.FPDF_RenderPageBitmap(*render_args)
KeyboardInterrupt
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!

case 1:

Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:12<00:00,  2.46s/it]
-> Cannot close object, library is destroyed. This may cause a memory leak!

script to reproduce (doctr main branch):

import time

from doctr.io import DocumentFile
from doctr.models import ocr_predictor

from tqdm import tqdm

paths = [
    "/home/felix/Desktop/2305.02577v1.pdf",
    "/home/felix/Desktop/602.pdf",
    "/home/felix/Desktop/1262.pdf",
    "/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf",
    "/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf",
] * 1000

model = (
    ocr_predictor(
        det_arch="fast_base",
        reco_arch="parseq",
        pretrained=True,
        det_bs=8,
        reco_bs=1024,
    )
    .cuda()  # remove without GPU
    .half()  # remove without GPU
)

for path in tqdm(paths, desc="Processing", total=len(paths)):
    start = time.time()
    doc = DocumentFile.from_pdf(path)
    pdf_len = len(doc)
    result = model(doc)
    json_res = result.export()
    res_len = len(json_res["pages"])
    print(time.time() - start)
    assert pdf_len == res_len, f"pdf_len: {pdf_len}, res_len: {res_len}"

memory was ~11 - 13 GB usage - no peaks (releasing / garabage collection looks correct)

data is attached :) data_share.zip

felixdittrich92 commented 1 month ago

with the code change the message disappeared everything else was the same on my end

mara004 commented 1 month ago

Firstly, thanks for looking into this @felixdittrich92. Unfortunately, I didn't manage to get your sample running yet - fails with some attribute error (see attachment): log.txt (I tried to use tensorflow rather than pytorch because the latter is so big with all its dependencies.) A slightly more minimalistic example, using only pypdfium2 itself, would be great...

I was not able to reproduce any mem leak but the message appears if:

  • the last file is finally processed
  • trough a keyboard interrupt

As long as it's only on before exit, that should be harmless, as stated already in https://github.com/mindee/doctr/pull/1624#issuecomment-2143576037 However, that sounds like atexit.register() does not promise that exit handlers be called after garbage object finalizers. Anyway, I would probably just leave things as-is in this regard, because it doesn't seem worth installing a global object weakref cache to walk through before destroying the library...

mara004 commented 1 month ago

Unfortunately, I didn't manage to get your sample running yet - fails with some attribute error

Ah, I just had a mismatched tensorflow version. Doing pip3 install -e .[tf] rather than pip3 install -e .; pip3 install tensorflow fixed it 😅 Retrying now...

mara004 commented 1 month ago

Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ...

felixdittrich92 commented 1 month ago

Still no luck with the script. I let it run for several minutes but it doesn't get past 0% ...

I assume you are running on CPU ? ^^ You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)

mara004 commented 1 month ago

I assume you are running on CPU ? ^^

Yes.

You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)

That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message.....

felixdittrich92 commented 4 weeks ago

I assume you are running on CPU ? ^^

Yes.

You should be able to reproduce it also with only 1 file (for example 602.pdf has only 3 pages)

That worked, though it still took rather long, even with just the 1 file. However, I didn't get the error message.....

Strange, i tested it also again with each single file .. no message .. tested with several bunchs .. no message .. tested again with all 5 file .. message appears xD

@mara004 script only to test the pypdfium2 part:

import time

from tqdm import tqdm

from doctr.io import DocumentFile

paths = [
    "/home/felix/Desktop/2305.02577v1.pdf",
    "/home/felix/Desktop/602.pdf",
    "/home/felix/Desktop/1262.pdf",
    "/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf",
    "/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf",
]

for path in tqdm(paths, desc="Processing", total=len(paths)):
    start = time.time()
    doc = DocumentFile.from_pdf(path)
    pdf_len = len(doc)
    print(len(doc))
    print(time.time() - start)

The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ? By using the ocr_predictor you have some delay that seems to be the reason why it doesn't appears all the time

for example if i process with this script only "/home/felix/Desktop/602.pdf", which has 4 pages the message appears 4 times ^^

mara004 commented 4 weeks ago

I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much...

@felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend

import pypdfium2.internal as pdfium_i
pdfium_i.DEBUG_AUTOCLOSE.value = True

and attach the output of a run that produced the error?

The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?

autoclose unfortunately is misleading. It just means, if the input is a byte buffer, to close that buffer when the document is closed (whether automatically or explicitly). Rather obsolete when I come to think of it; perhaps I should just remove that param...

felixdittrich92 commented 4 weeks ago

I ran the script a couple of times, but again can't reproduce. Though, as the issue appears to be time / garbage collection dependent, that doesn't have to mean much...

@felixdittrich92 Given that you can, at least sometimes, reproduce the issue, could you please prepend

import pypdfium2.internal as pdfium_i
pdfium_i.DEBUG_AUTOCLOSE.value = True

and attach the output of a run that produced the error?

The message appears for each page in the documents now... looks to me like with autoclose=True the closing isn't awaited / or it's to fast for the pdfium backend -- @mara004 could this be possible ?

autoclose unfortunately is misleading. It just means, if the input is a byte buffer, to close that buffer when the document is closed (whether automatically or explicitly). Rather obsolete when I come to think of it; perhaps I should just remove that param...

Here you go :)

Processing:   0%|                                                                                                                                                                               | 0/5 [00:00<?, ?it/s]19
0.7914893627166748
Processing:  20%|█████████████████████████████████▍                                                                                                                                     | 1/5 [00:00<00:03,  1.26it/s]3
0.02114415168762207
6
0.036025047302246094
Close (auto) <PdfPage uuid:333e76e7>
Close (auto) <PdfPage uuid:4b3cc566>
Close (auto) <PdfPage uuid:b706c0f4>
Close (auto) <PdfPage uuid:f9093a3d>
Close (auto) <PdfPage uuid:8e615b76>
Close (auto) <PdfPage uuid:4dd695af>
Close (auto) <PdfPage uuid:233a80ea>
Close (auto) <PdfPage uuid:0de4c8e0>
Close (auto) <PdfPage uuid:8394319f>
Close (auto) <PdfPage uuid:d2deeca4>
Close (auto) <PdfPage uuid:01602212>
Close (auto) <PdfPage uuid:4ea048f9>
Close (auto) <PdfPage uuid:78a2e272>
Close (auto) <PdfPage uuid:6d8fe051>
Close (auto) <PdfPage uuid:067c7284>
Close (auto) <PdfPage uuid:51e1682b>
Close (auto) <PdfPage uuid:267dd5af>
Close (auto) <PdfPage uuid:901b5056>
Close (auto) <PdfPage uuid:25733567>
Close (auto) <PdfPage uuid:9437fe80>
Close (auto) <PdfPage uuid:1bc110fd>
Close (auto) <PdfPage uuid:0bd464e8>
Close (auto) <PdfPage uuid:53a3c703>
Close (auto) <PdfPage uuid:e5dab713>
Close (auto) <PdfPage uuid:95aeb5aa>
Close (auto) <PdfPage uuid:4afdcda9>
Close (auto) <PdfPage uuid:fac8b29a>
Close (auto) <PdfPage uuid:771244dc>
Close (auto) <PdfPage uuid:597d6e69>
Close (auto) <PdfPage uuid:8ea275cb>
Close (auto) <PdfPage uuid:688eb2f6>
Close (auto) <PdfPage uuid:4e298037>
Close (auto) <PdfPage uuid:f898e06c>
20
0.13844823837280273
Processing:  80%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                 | 4/5 [00:00<00:00,  4.93it/s]16
0.09407329559326172
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.62it/s]
Close (auto) <PdfPage uuid:e3c3896b>
Close (auto) <PdfPage uuid:29e9deff>
Close (auto) <PdfPage uuid:db94fb58>
Close (auto) <PdfPage uuid:1097da0b>
Close (auto) <PdfPage uuid:0ba7b27f>
Close (auto) <PdfPage uuid:c83f918b>
Close (auto) <PdfPage uuid:f4b33617>
Close (auto) <PdfPage uuid:0933d63b>
Close (auto) <PdfPage uuid:e737b1f5>
Close (auto) <PdfPage uuid:8e84a54e>
Close (auto) <PdfPage uuid:cb49229e>
Close (auto) <PdfPage uuid:83ed8321>
Close (auto) <PdfPage uuid:0476ffdf>
Close (auto) <PdfPage uuid:5ee10383>
Close (auto) <PdfPage uuid:0edbe852>
Close (auto) <PdfPage uuid:3c450b88>
Close (auto) <PdfPage uuid:bc3f33f0>
Close (auto) <PdfPage uuid:954516f5>
Close (auto) <PdfPage uuid:081b092d>
Close (auto) <PdfPage uuid:ddd88b13>
Close (auto) <PdfPage uuid:86c9d1f2>
Close (auto) <PdfPage uuid:c2c96fe6>
Close (auto) <PdfPage uuid:f2bb8611>
Close (auto) <PdfPage uuid:9bf25db7>
Close (auto) <PdfPage uuid:7da65012>
Close (auto) <PdfPage uuid:bb4188a4>
Close (auto) <PdfPage uuid:9ea12db3>
Close (auto) <PdfPage uuid:6203ce20>
Close (auto) <PdfPage uuid:c47cb474>
Close (auto) <PdfPage uuid:daafbdb6>
Close (auto) <PdfDocument uuid:60b0a002 from '/home/felix/Desktop/doctr_test_data/vitstr_paper.pdf'>
Close (auto) <PdfPage uuid:ba5a1396>
Close (auto) <PdfDocument uuid:f3566c65 from '/home/felix/Desktop/doctr_test_data/Versicherungsbedingungen-08-2021.pdf'>
Close (auto) <PdfDocument uuid:125811aa from '/home/felix/Desktop/1262.pdf'>
Close (auto) <PdfDocument uuid:d0aa77e1 from '/home/felix/Desktop/602.pdf'>
Close (auto) <PdfDocument uuid:4b5964a9 from '/home/felix/Desktop/2305.02577v1.pdf'>
Destroy PDFium (auto)

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ USE_TORCH=1 python3 /home/felix/Desktop/doctr/test_scripts/test_pdf.py
Processing:   0%|                                                                                                                                                                                           | 0/5 [00:00<?, ?it/s]19
0.7971949577331543
Processing:  20%|███████████████████████████████████▊                                                                                                                                               | 1/5 [00:00<00:03,  1.25it/s]3
0.021633386611938477
6
0.03635716438293457
20
0.14321637153625488
Processing:  80%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏                                   | 4/5 [00:00<00:00,  4.87it/s]16
0.09462189674377441
Processing: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.57it/s]
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
-> Cannot close object, library is destroyed. This may cause a memory leak!
(doctr-dev) felix@felix-Z790-AORUS-MASTER:~/Desktop/doctr$ 
felixdittrich92 commented 4 weeks ago

@mara004 yeah :sweat_smile: autoclose sounds to me like a context manager under the hood which takes the job to close the file

mara004 commented 4 weeks ago

@felixdittrich92 Erm, concerning the output:

autoclose sounds to me like a context manager under the hood which takes the job to close the file

That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly. I agree that naming was unfortunate. I will probably remove autoclose in v5 and just ask the caller to close the buffer directly rather than via PdfDocument param.

felixdittrich92 commented 4 weeks ago

@felixdittrich92 Erm, concerning the output:

  • This is from two separate invocations, right?
  • In the first one, you added the debug info, but the error did not appear.
  • In the second one, where the error appeared, it seems like you may have forgotten to add the DEBUG_AUTOCLOSE ? Or did you, but there were no debug messages?

autoclose sounds to me like a context manager under the hood which takes the job to close the file

That's sort of what the finalizers do, but they're always active as a safety net, unless you close explicitly. I agree that naming was unfortunate. I will probably remove autoclose in v5 and just ask the caller to close the buffer directly rather than via PdfDocument param.

mara004 commented 4 weeks ago

Ah, I see. That's unfortunate, as what I am interested in is, of course, the error with debug info, because both alone is sort of useless. I am guessing this is again simply due to the non-deterministic nature of garbage collection. Feel free to report back if anyone manages to get the error with debug info.

felixdittrich92 commented 4 weeks ago

Alright :+1:

Now i'm struggling a bit with the scope of this PR, because i agree that the message is a bit scary but it doesn't looks like there is any memory leak on our / on pypdfium2 side .. so the only change would be that the message disappears.

Could you share some profiling logs with and without your changes please @justinjosephmkj ?

mara004 commented 4 weeks ago

i agree that the message is a bit scary but it doesn't looks like there is any memory leak on our / on pypdfium2 side

Yes, I also intend to clarify the message upstream that this is harmless if it happens only on process exit. In any case, the explicit closing as in https://github.com/mindee/doctr/pull/1624#issuecomment-2143818802 should be fine, provided you require pypdfium2 >= 4.11.

justinjosephmkj commented 4 weeks ago

@felixdittrich92 @mara004 Thank you both for looking into this issue, I have updated Name: pypdfium2 to Version: 4.30.0. Now the message

-> Cannot close object, library is destroyed. This may cause a memory leak!

only shows up right before the process exits.

This will give the message for all pdf files.

gc.set_threshold(10, 1, 1)

The issue not causing an memory leak as of now but the message is coming up.

mara004 commented 4 weeks ago

I have updated Name: pypdfium2 to Version: 4.30.0. Now the message [...] only shows up right before the process exits.

As you can see yourself, there is no related change between 4.28...4.30, so any difference in runtime closing behavior would be coincidence.

gc.set_threshold(10, 1, 1)

FWIW, I still do not get the message despite adding this, with Fedora-provided Python 3.11.6.


That said, I looked into the weakref module again and found the finalizer atexit attribute. From theoretical considerations I think setting this to False should make the message disappear – can you test the following patch?

diff --git a/src/pypdfium2/internal/bases.py b/src/pypdfium2/internal/bases.py
index af34c30a..4f14ce73 100644
--- a/src/pypdfium2/internal/bases.py
+++ b/src/pypdfium2/internal/bases.py
@@ -72,6 +72,7 @@ class AutoCloseable (AutoCastable):
         # NOTE this function captures the value of the `parent` property at finalizer installation time - if it changes, detach the old finalizer and create a new one
         assert self._finalizer is None
         self._finalizer = weakref.finalize(self._obj, _close_template, self._close_func, self.raw, repr(self), self._autoclose_state, self.parent, *self._ex_args, **self._ex_kwargs)
+        self._finalizer.atexit = False

     def _detach_finalizer(self):
         self._finalizer.detach()
Log trailer without patch ``` Processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.89it/s] Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Close (auto) Destroy PDFium (auto) ```
Log trailer with patch ``` Processing: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00, 1.92it/s] Destroy PDFium (auto) ```

^ The only problem I see with this patch is, if you embed a Python interpreter in another process that keeps running after the interpreter, not invoking finalizers at interpreter shutdown would cause an actual memory leak. Same goes if you get the "cannot close object" warning in an embedded interpreter. But as long as Python is hosted in an own process, it should be fine...

felixdittrich92 commented 3 weeks ago
self._finalizer.atexit = False
pypdfium2                 4.30.0+10.gff2ff4b.dirty          pypi_0    pypi

Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)

mara004 commented 3 weeks ago

Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)

That's very weird... You don't have a conflicting build of pypdfium2 installed, by any chance? like from conda?

felixdittrich92 commented 3 weeks ago

Applied your suggested change but this doesn't change anything - same messages (also with debug flag) as before :)

That's very weird... You don't have a conflicting build of pypdfium2 installed, by any chance? like from conda?

Nope cloned your repo and installed it in editable mode to apply your suggestion :)

felixdittrich92 commented 3 weeks ago

@justinjosephmkj Could you update your PR with the suggestion from https://github.com/mindee/doctr/pull/1624#issuecomment-2143818802 please ? :) And update https://github.com/mindee/doctr/blob/674a875db8339ff7ad58e3b25441343ae3d01bfe/pyproject.toml#L41 to "pypdfium2>=4.11.0,<5.0.0", same here: https://github.com/mindee/doctr/blob/674a875db8339ff7ad58e3b25441343ae3d01bfe/.conda/meta.yaml#L28

Thanks :)

justinjosephmkj commented 3 weeks ago

@felixdittrich92 I have updated the code in the PR.

felixdittrich92 commented 3 weeks ago

@justinjosephmkj Could you please update the version identifier as mentioned in https://github.com/mindee/doctr/pull/1624#issuecomment-2154150716 also ? And remove the autoclose=True ? :)

justinjosephmkj commented 3 weeks ago

@felixdittrich92 Updated.

@justinjosephmkj Could you please update the version identifier as mentioned in #1624 (comment) also ? And remove the autoclose=True ? :)

codecov[bot] commented 3 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 96.42%. Comparing base (9c92f5c) to head (07583e4). Report is 6 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1624 +/- ## ========================================== + Coverage 96.25% 96.42% +0.17% ========================================== Files 163 164 +1 Lines 7707 7745 +38 ========================================== + Hits 7418 7468 +50 + Misses 289 277 -12 ``` | [Flag](https://app.codecov.io/gh/mindee/doctr/pull/1624/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=mindee) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/mindee/doctr/pull/1624/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=mindee) | `96.42% <100.00%> (+0.17%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=mindee#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

mara004 commented 2 weeks ago

I'm still confused about what may be the cause of this issue, why .atexit = False reportedly does not prevent it, and why I can't reproduce.

After all, pypdfium2 is simply a caller of atexit.register() / weakref.finalize(), and if that's just how these APIs end up (mis-)behaving for some users, I don't see what we could do about that from our side...

Is there any chance this happens only with specific versions, or builds, of Python?

felixdittrich92 commented 2 weeks ago

@mara004 which OS do you use ? Mh tried it on python 3.10 and 3.11 same thing so i don't think so :sweat_smile:

mara004 commented 2 weeks ago

which OS do you use ?

I'm on Fedora 37 with system-provided python 3.11.6. Same question to you (which OS?) ;) And are you using a linux distro build of python, an official binary from python.org, or a conda-provided python?

felixdittrich92 commented 2 weeks ago

which OS do you use ?

I'm on Fedora 37 with system-provided python 3.11.6. Same question to you (which OS?) ;) And are you using a linux distro build of python, an official binary from python.org, or a conda-provided python?

Ubuntu 22.04 LTS and python from conda. So in this case it would be close to test it in a container again ^^

mara004 commented 2 weeks ago

Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce. I'll test on my friends' Ubuntu 20 and 22 devices at some point.

Also, pypdfium2's own test suite does not show the issue on GH actions.

felixdittrich92 commented 2 weeks ago

Hmm, tried python 3.11.9 from conda on Fedora and again couldn't reproduce. I'll test on my friends' Ubuntu 20 and 22 devices at some point.

Also, pypdfium2's own test suite does not show the issue on GH actions.

Strange :sweat_smile: