Closed macdeport closed 1 month ago
@macdeport
Can you make a test pdf with one page only and usine page.remove_text()
?
fp='/Users/alain/Documents/Perso/Alain/SDC35rM/sdc35-24-4!4-240905.pdf'
#--------------------------
def pdf_text_test(pdf_path):
"""
(06/09/24 13:18:36)
"""
#https://pypdf.readthedocs.io/en/stable/
#https://pypdf.readthedocs.io/en/stable/user/metadata.html
from pypdf import PdfReader
reader = PdfReader(pdf_path)
#txt=''
#for page in reader.pages:
# txt += page.extract_text() # <= PB Crash
print(reader.pages[0])
(reader.pages[0]).remove_text()
return() # pdf_text()
#--------------------------
pdf_text_test(fp)
{'/Type': '/Page', '/Parent': IndirectObject(3, 0, 4337925520), '/Contents': IndirectObject(5, 0, 4337925520), '/MediaBox': [0, 0, 595, 841], '/Resources': {'/Font': {'/F00': IndirectObject(6, 0, 4337925520), '/F01': IndirectObject(8, 0, 4337925520), '/F02': IndirectObject(10, 0, 4337925520), '/F03': IndirectObject(12, 0, 4337925520)}, '/ProcSet': IndirectObject(15, 0, 4337925520)}}
Traceback (most recent call last):
File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 26, in <module>
pdf_text_test(fp)
File "/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/com.barebones.bbedit/BBEditRunTemp-untitled text 3.py", line 21, in pdf_text_test
(reader.pages[0]).remove_text()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PageObject' object has no attribute 'remove_text'
oups : remove_text() applies to the full pdf. so the code should be like (from the top of my head):
import pypdf
w = pypdf.PdfWriter()
w.append("original.pdf",[0])
w.remove_text()
w.write("test_file.pdf")
check the file : no sensitive data should be in
Two pieces of good news:
remove_text()
works perfectly: the private text has completely disappeared,dumb_extract_text_crash.pdf
continues to produce a crash despite the removal of the text.
Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
Sorry I can't share this PDF with private information.
Traceback
This is the complete traceback I see: