Open mara004 opened 7 months ago
@semoal Thanks. There is no set time frame, and I'm currently immersed in some other things, but I'd be hoping to merge sooner than later, to avoid another stalled and diverged branch, as had unfortunately happened on the previous go at this.
However, this project has grown a bit over my head TBH, and I'm somewhat scared of breaking anything or making wrong API decisions, as this may affect many downstreams. Also, I'd like to address all API-breaking or otherwise significant changes I had in mind before going ahead with this.
Out of interest, is there any particular change you're looking forward to?
@semoal Thanks. There is no set time frame, and I'm currently immersed in some other things, but I'd be hoping to merge sooner than later, to avoid another stalled and diverged branch, as had unfortunately happened on the previous go at this.
However, this project has grown a bit over my head TBH, and I'm somewhat scared of breaking anything or making wrong API decisions, as this may affect many downstreams. Also, I'd like to address all API-breaking or otherwise significant changes I had in mind before going ahead with this.
Out of interest, is there any particular change you're looking forward to?
The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled. Once it's stabilized i would create a pre-release or release candidate and start receiving feedback from there, there's no future without breaking changes ;)
The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.
I see. FWIW, you can already use the semi-private page._flatten()
if you make sure init_forms()
was called on the parent pdf before page retrieval (ideally, directly after construction).
The bindings code is the same, just a check added and docs updated. You could also copy the flatten()
implementation over into your own code.
Sorry for the inconvenience; this originated from a time where form initialization wasn't integrated properly.
The flatten function exposed it's a pain-killer, we're struggling about to extract some information of a pdf with form fields filled.
I see. FWIW, you can already use the semi-private
page._flatten()
if you make sureinit_forms()
was called on the parent pdf before page retrieval (ideally, directly after construction). The bindings code is the same, just a check added and docs updated. You could also copy theflatten()
implementation over into your own code. Sorry for the inconvenience; this originated from a time where form initialization wasn't integrated properly.
@mara004 I have tried using page._flatten() with the instructions you have given as follows:
pdf = pypdfium2.PdfDocument(pdf_path)
pdf.init_forms()
for page_idx in page_range: # page_range --> 2
page = pdf.get_page(page_idx)
page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 1
text_page = page.get_textpage()
page = pdf.get_page(page_idx)
page._flatten(flag=pdfium_c.FLAT_NORMALDISPLAY) # return 2
text_page = page.get_textpage()
...
total_chars = text_page.count_chars()
I have to repeat the _flatten() twice to get all the editable values from the form.
number of characters without repeating _ flatten code --> totalchars # 4619 number of characters with repeating flatten code --> total_chars # 5014
I can't really comment on that behavior as I'm only providing the bindings, and what the underlying APIs actually do is down to pdfium.
However, given that the second _flatten()
call returns 2, which is equal to pdfium_c.FLATTEN_NOTHINGTODO
, it should be a no-op (FWIW, you can take a look at the fpdf_flatten.cpp
code and see when FLATTEN_NOTHINGTODO
is returned).
So perhaps you just have to re-initialize the page handle? Or maybe call page.gen_content()
after flattening?
Finally, I just had to re-initialize the page handle. page.gen_content() option didn't work. Thanks
That makes sense. I can add a note to the future docs that flattening invalidates existing handles to the page.
Changes looks quite good! Really impressed, is there any estimated ETA?