py-pdf / pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of PDF files
https://pypdf.readthedocs.io/en/latest/
Other
8.28k stars 1.4k forks source link

Support for outline item external references #2648

Closed dawillcox closed 5 months ago

dawillcox commented 5 months ago

Explanation

I'm not sure if this is a request for a new feature or documentation to explain how this is already possible...

My knowledge of PDF internal format is microscopic, but I know that PDF supports internal links (to images, pages, etc.) and external links (web pages, other files, email addresses, ...) don't see how pypdf supports external links.

Here's my situation: I have a PDF file (from a CD I purchased) that has outline links to pages and external files. It's a scan of a book, almost 1200 pages, so the links to sections of the document are quite handy. Trouble is, the pages are all just images. It would be very useful to be able to search for text and copy text for use elsewhere. (Fair use, of course.)

Yes, I know there are resources that OCR scan PDF files, but everything I've tried balks at a file that large, at least without a charge.

So I:

  1. Split the big file into 100 page chunks.
  2. OCR scanned each chunk.
  3. Merged the scanned chunks back into a single file.

Which worked perfectly. Except, while the text in the result is all nicely scanned, the outline is gone. So, I'm using pypdf to merge the original document's outline into the scanned document. And that works fine for the outline options that are just headers, and links to pages within the document, but the external links are gone.

See code example below. This is just the inner logic to deal with a single outline entry, obviously there's outer logic to deal with lists and embedded lists.

Code Example

Here's what I'm doing now:

from pypdf import PdfReader, PdfWriter

# Setup is basically this:
from_file = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_file = PdfReader(open(SCANNED_FILE, 'rb'))
to_file = PdfWriter()
to_file.append_pages_from_reader(scanned_file)

# so at this point, from_file has the desired outlines, and
# to_file has all of the OCR scanned pages but no outlines. 
# (Or much of anything else.)

# Then follows loops to apply Destinations from from_file.outline to to_file. 
# Omitting the looping logic, each destination is handled as:

        pgno = from_file.get_destination_page_number(outline)
        if pgno is None:
            next_parent = to_file.add_outline_item_dict(outline, parent=parent_outline)
        else:
            next_parent = to_file.add_outline_item(outline.title, page_number=pgno, parent=parent_outline)

# next_parent becomes parent_outline for embedded lists.

# This works fine for references to pages, but external references are lost.
# They just become an item in the outline, but they don't behave like 
# in the original document.

So the question is: Is this something that can be done with the current release, but it's too obscure for me to figure out? Or would it be a useful addition in the future? Said feature probably would need a way to tell if an existing outline entry was an external reference, plus a way to specify such a reference in a new file.

Though now that I think of it, outlines can point to other internal things like images. Maybe those are IndirectObjects so already supported?

pubpub-zz commented 5 months ago

I may have an idea, but I would need an example of an original page and the output of the OCR processing to confirm it

dawillcox commented 5 months ago

The problem is the file is quite large, and just a single page wouldn't demonstrate the problem. I could try this on a smaller file, though.

pubpub-zz commented 5 months ago

I would like to see if I can merge back the scanned data into the original page Let me do my test 😉... It should worth it.

dawillcox commented 5 months ago

That would be awesome, but I couldn't see how I'd even start. You'd have to extract the text along with all of the data that matched text to location in the image. I see no way to do that.

pubpub-zz commented 5 months ago

please do What I've asked: extract one page from your original doc

w = pypdf.PdfWriter()
w.append("doc_source.pdf",pages=[10])  # replace 10 by the page number with some text non sensitive/copyrighted
w.write("one_page_out.pdf")

apply the ocr process you've selected (out of pypdf scope) publish one_page_out.pdf and the processsed page

dawillcox commented 5 months ago

The trouble with that is I don't know how I'd create the outlines items to correspond to the one page.

So here's a variation. Two files that reproduce the issue without being huge and with just cover page, so nobody should be unhappy about content. Files will be attached, I hope. Here's my tacky code to do what I want:

from pypdf.generic import Destination
from pypdf import PdfReader, PdfWriter

# This stands in for the original file. All of the images are removed,
# just the first couple of pages are there.
ORIGINAL_FILE = 'with_outline.pdf'

# This stands in for the OCR scanned file. The outline is gone, but
# a couple of pages have text added that isn't in ORIGINAL_FILE.
# This will verify that the final product has pages from SCANNED_FILE.
SCANNED_FILE = 'altered_pages.pdf'

# This is the output of the merge.  A couple of pages are marked to verify
# that the 'A' and 'B' outline items go to the right place.
OUTPUT_FILE = 'after_merge.pdf'

def copy_index(from_file: PdfReader, to_file: PdfWriter, outlines, parent_outline=None):
    next_parent = parent_outline
    for outline in outlines:
        if isinstance(outline, Destination):
            pgno = from_file.get_destination_page_number(outline)
            if pgno is None:
                next_parent = to_file.add_outline_item_dict(outline,
                                                            parent=parent_outline)
            else:
                next_parent = to_file.add_outline_item(outline.title,
                                                       page_number=pgno,
                                                       parent=parent_outline)
        elif isinstance(outline, list):
            copy_index(from_file, to_file, outline, parent_outline=next_parent)

index_pdf = PdfReader(open(ORIGINAL_FILE, 'rb'))
scanned_pdf = PdfReader(open(SCANNED_FILE, 'rb'))
writer = PdfWriter()
writer.append_pages_from_reader(scanned_pdf)
copy_index(index_pdf, writer, index_pdf.outline)
writer.write(OUTPUT_FILE)

altered_pages.pdf with_outline.pdf

pubpub-zz commented 5 months ago

no worries about outlines, these will be naturally copied from your scanned document. What I'm interested in your document is altered_pages. looking at page 2 I can clear text : can you clarify weither this is the output of the OCR ? can you extract send the original page : if you use the code I've provided the outlines should be extracted too.

Once I will have both, I should be able to propose some code merge the text/content from altered_pages. I need at least one page with the images the file with_outline is useless for me

dawillcox commented 5 months ago

I'm clearly not communicating this well.

Yes, generally when you do an OCR scan of a document, the outlines are preserved. Task done, game over. No complaints here.

The problem is that my actual document is so large that scanners balk. So I split the big file into smaller chunks, scanned each chunk, then joined the chunks into one big file. That leaves me with another big file with all of the scanned text but no outline. I'm trying to copy the outline from the original file to the scanned one.

The files and code I just uploaded demonstrate the problem of copying indexes; the content of the pages shouldn't be an issue.

But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler.

But still, wouldn't it be nice if pypdf had some kind of support for external links?

Update: Replacing the pages in the original document with the scanned pages doesn't work, presumably because the outline refers to the actual page, and if the page is removed the outline can't point at it any more.

pubpub-zz commented 5 months ago

What I have in mind is the following approach: using

w=PdfWriter()
w.append("input.pdf",(0,50))
w.write("chunk1.pdf")
w=PdfWriter()
w.append("input.pdf",(50,100))
w.write("chunk2.pdf")

you will have chunks of input that would have kept outlines From your comments I do understand that outlines are preserved by OCR so if you use:

w = PdfWriter()
w.append("OCR_chunk1.pdf")
w.append("OCR_chunk2.pdf")
w.write("fullOCR_with_outlines.pdf")

Should work.

your proposal

But hmm. I wonder if I could load the original document into a writer, delete the pages, then add the pages from the scanned document. That would be way simpler.

You should not need to remove the image: use w = PdfWriter("original.pdf") to create and then use .merge_page(page_from_reader,over=False) to hide the text behind the image

But still, wouldn't it be nice if pypdf had some kind of support for external links?

I agree, I though it was already in... need to check more

dawillcox commented 5 months ago

So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content.

If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think).

Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content.

And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code.

Bottom line, I finally got ocrmypdf working. (I had problems with the ghostscript library before.) I found that

I'm guessing that the best bet would be to somehow copy the scanned text and the hints that say where it's placed and apply that to the original pages. No clue where to start for that, though. Certainly no clues from pypdf. And I can't be sure that the original pages weren't adjusted somehow.

So, bottom line, the file I have, absent the external links, works well enough for my purposes. I'd love to know how external references and/or the OCR-applied text works, and could be moved from one file to another. But at this point it's more a matter of intellectual curiosity.

stefan6419846 commented 5 months ago

Your code seems to handle outlines only. Shouldn't external links (however this would behave with scanned files) rather be an annotation (https://pypdf.readthedocs.io/en/latest/user/adding-pdf-annotations.html#link)?

dawillcox commented 5 months ago

Hmm. You may be onto something. But can annotations be on an outline item? The external links (to other files) in the original file seem to be attached to the outline (TOC) entries, not pages. If you click on an entry on the TOC it opens an external file, or goes to a page in the file. The latter is what my code does, but it can't figure out the external links.

Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document.

But doing a preliminary investigation, what the documentation shows for finding annotations on a page doesn't find annotations on outline items.

stefan6419846 commented 5 months ago

Conversely, I'm guessing that OCR'd text may be an annotation and I could use that to copy the OCR'd text to the original document.

Usually no. This probably just is a basic text layer, maybe with an "invisible" font which allows copying the text, but does not conflict with the possibly different font and text parameters of the scanned image.

dawillcox commented 5 months ago

Just stepping through the .extract_text() code in page I can see that pulling out the OCR results and applying to another page would be be a challenge (putting it mildly).

But can you suggest how annotations and outlines might be related? For example, if you click "Welcome Page" in the TOC of the with_outline.pdf file I attached earlier, it tries to open another document. Which isn't there so it fails, but at least the readier tried to open the file.

pubpub-zz commented 5 months ago

I've finally been able to generate a test as I was expecting: file with image only but with outline: tt1_outline.pdf output of the OCR: the text is invisible but present and with an image on top: tt1-sortie.pdf

then to merge it we can use the following code:

import pypdf
w = pypdf.PdfWriter("tt1_outline.pdf")
w2 = pypdf.PdfWriter("tt1-sortie.pdf")
w2.remove_images()   # to remove the scanned image before merging
w.pages[0].merge_page(w2.pages[0],over=False) # the OCR page is put behind to ensure to not overlay over the original image
w.write("tt1_merged.pdf")
w.pages[0].extract_text(extraction_mode="layout")
#returns : 'PDF        Reference\n  sixthedition\n\n\n  Adobe°  Portable  Document   Format\n     Version1.7\n     November2006\n\n\n\n     Adobe  SystemsIncorporated'

the output: tt1_merged.pdf

pubpub-zz commented 5 months ago

So here's the problem. Outline items that point to a page refer to a specific page object, not just a page number. (Or an image, or other internal object.) That way, if you have an index set up, and then add or remove pages, the outline item still points to the same content.

If you look into the pdf spec, this is the way pages are pointed. page number are reserved for links to external pages

If you remove a page I don't know if the index item is deleted or just doesn't point anywhere at all. If you remove a page with clean=True, the deleted page is replaced by a blank one and any index still points to it (I think).

correct

Unfortunately, there seems to be no way to replace the content of a page, keeping the page ID the same but new content.

using replace_content will not transfert the resources. My solution is operational

And looking at outline content in a debugger, I haven't been able to suss out how external destinations are specified. That seems to be thoroughly obfuscated in the code.

I recommend you to use pdfbox with debug option

dawillcox commented 5 months ago

Well, except I had to change

w = pypdf.PdfWriter("tt1_outline.pdf")

to

w = PdfWriter()
w.clone_document_from_reader(PdfReader("tt1_outline.pdf"))

for each file, (otherwise the pages list was empty) but that worked! Magic!

I ran this on my monster file, and it worked too. (Took a while, though. A lot of mucking with stuff happens in there.) Thanks loads for your help! I never would have figured this out by myself.

I still wonder how external links from outline entries work, but at this point it's just intellectual curiosity, and I have plenty other things to keep me busy.

pubpub-zz commented 5 months ago

Oups I'm on the dev Use PdfWriter(clone_from='input pdf')

stefan6419846 commented 5 months ago

I am going to close this issue for now as it sounds solved.

dawillcox commented 5 months ago

Well, yes, we found a solution to my particular problem. It still would be nice if support for external links (in table of contents and maybe other places), both creating and finding, could be added to the list of possible enhancements.

stefan6419846 commented 5 months ago

External links should already be supported. For further discussions or issues about it, I recommend opening a new discussion with an explicit example file.