Preserve hypertext links

philagee commented 4 years ago

Problem: Output files from OCR (output-type=pdf or pdfa, without or without -f, without or without -c) lose hypertext links associated to text (e.g. hyperlinks for table of contents entries, inline footnote numbers, inline endnote numbers that jump to page, footnote text, or endnote text). Text in output file is no longer hyperlinked.

Solution: Preserve hyperlinks so that text in output file is hyperlinked as defined in input file.

Thank you for an amazing tool!

jbarlow83 commented 4 years ago

Can you provide a file that demonstrates this and the command line you used?

There are multiple ways software can create a hyperlink in a PDF.

On Thu., Jul. 30, 2020, 13:47 Phil Agee, notifications@github.com wrote:

Problem: Output files from OCR (output-type=pdf or pdfa, without or without -f, without or without -c) lose hypertext links associated to text (e.g. hyperlinks for table of contents entries, inline footnote numbers, inline endnote numbers that jump to page, footnote text, or endnote text). Text in output file is no longer hyperlinked.

Solution: Preserve hyperlinks so that text in output file is hyperlinked as defined in input file.

Thank you for an amazing tool!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM4BVPZLYP3TSE2Y4F3R6HL55ANCNFSM4POT6LJQ .

philagee commented 4 years ago

Attached is a file created using the following command line:

ocrmypdf -f

Clicking on the entry for chapter one in the table of contents does not navigate to page where the chapter begins.

pride-and-prejudice-ocrmypdf.pdf .

Thanks!

jbarlow83 commented 4 years ago

By design --force-ocr is going to discard hyperlinks and other active content. A major use case of this feature is getting as much content out of possible out of damaged PDF files, such as those that have missing Unicode tables. (This feature is not well named, I admit, but it's also hard to describe in a word or two.)

I would still like to look at the case without --force-ocr. Do you have the original input file, before processing with ocrmypdf? It looks like you send me the one after processing. I'd like to compare to see what got dropped along the way.

philagee commented 4 years ago

Attached is the original.

I tried converting without any options and got the notice about already existing text (PriorOcrFoundError: page already has text!). So I then added --force-ocr.

pride-and-prejudice.pdf

jbarlow83 commented 4 years ago

--force-ocr is going discard hyperlinks, that is intended behavior.

If you use --skip-text to skip OCR on pages that already have printable text (i.e. all pages on this file), and --output-type pdf to disable PDF/A conversion, the hyperlinks appear and still work.

jbarlow83 commented 4 years ago

It seems that Ghostscript's PDF/A conversion removes links, even with -dPrinted=false which as explained here should prevent links from being deleted.

I suppose my answer needs to be that, if you want to keep hyperlinks, use --output-type pdf instead of the default. I don't think I will escalate this to Ghostscript.

philagee commented 4 years ago

I can confirm that --skip-text and --output-type pdf preserve the hyperlinks.

Thank you again for making and supporting this extremely useful tool!

damnms commented 3 years ago

I have a .pdf that contains on nearly every page text and a picture. Only applying --output-type pdf destroys the ToC's hyperlinks, its required to do --skip-text. Which leads to the situation that nothing is ocr'd ... Thanks for your tool, i really appreciate that, but thats unfortunately a killer for me :/

ocrmypdf 20HS_AFM_Zusammenfassung.pdf 20HS_AFM_Zusammenfassung.ocr.pdf --output-type pdf --skip-text
Scanning contents: 100%|█████████████████████| 95/95 [00:00<00:00, 399.12page/s]
Start processing 16 pages concurrently
    1 skipping all processing on this page                                      
    3 skipping all processing on this page                                      
    5 skipping all processing on this page                                      
    8 skipping all processing on this page                                      
    4 skipping all processing on this page                                      
    9 skipping all processing on this page                                      
   17 skipping all processing on this page                                      
   10 skipping all processing on this page                                      
    6 skipping all processing on this page                                      
   19 skipping all processing on this page                                      
    7 skipping all processing on this page                                      
   20 skipping all processing on this page                                      
   21 skipping all processing on this page                                      
    2 skipping all processing on this page                                      
   18 skipping all processing on this page                                      
   16 skipping all processing on this page                                      
   24 skipping all processing on this page                                      
   22 skipping all processing on this page                                      
   23 skipping all processing on this page                                      
   11 skipping all processing on this page                                      
   13 skipping all processing on this page                                      
   12 skipping all processing on this page                                      
   14 skipping all processing on this page                                      
   15 skipping all processing on this page                                      
   25 skipping all processing on this page                                      
   27 skipping all processing on this page                                      
   28 skipping all processing on this page                                      
   29 skipping all processing on this page                                      
   26 skipping all processing on this page                                      
   30 skipping all processing on this page                                      
   31 skipping all processing on this page                                      
   32 skipping all processing on this page                                      
   33 skipping all processing on this page                                      
   34 skipping all processing on this page                                      
   35 skipping all processing on this page                                      
   37 skipping all processing on this page                                      
   38 skipping all processing on this page                                      
   39 skipping all processing on this page                                      
   40 skipping all processing on this page                                      
   42 skipping all processing on this page                                      
   41 skipping all processing on this page
OCR: 100%|██████████████████████████████| 95.0/95.0 [00:00<00:00, 2896.16page/s]
   36 skipping all processing on this page
   43 skipping all processing on this page
   44 skipping all processing on this page
   51 skipping all processing on this page
   50 skipping all processing on this page
   59 skipping all processing on this page
   57 skipping all processing on this page
   46 skipping all processing on this page
   53 skipping all processing on this page
   60 skipping all processing on this page
   48 skipping all processing on this page
   47 skipping all processing on this page
   55 skipping all processing on this page
   54 skipping all processing on this page
   62 skipping all processing on this page
   61 skipping all processing on this page
   49 skipping all processing on this page
   45 skipping all processing on this page
   56 skipping all processing on this page
   52 skipping all processing on this page
   58 skipping all processing on this page
   63 skipping all processing on this page
   64 skipping all processing on this page
   66 skipping all processing on this page
   65 skipping all processing on this page
   67 skipping all processing on this page
   68 skipping all processing on this page
   69 skipping all processing on this page
   70 skipping all processing on this page
   71 skipping all processing on this page
   72 skipping all processing on this page
   73 skipping all processing on this page
   74 skipping all processing on this page
   75 skipping all processing on this page
   76 skipping all processing on this page
   77 skipping all processing on this page
   78 skipping all processing on this page
   79 skipping all processing on this page
   80 skipping all processing on this page
   81 skipping all processing on this page
   82 skipping all processing on this page
   83 skipping all processing on this page
   84 skipping all processing on this page
   86 skipping all processing on this page
   85 skipping all processing on this page
   88 skipping all processing on this page
   89 skipping all processing on this page
   93 skipping all processing on this page
   90 skipping all processing on this page
   91 skipping all processing on this page
   87 skipping all processing on this page
   92 skipping all processing on this page
   95 skipping all processing on this page
   94 skipping all processing on this page
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 0.99 savings: -0.7%
Image optimization did not improve the file - discarded

also the .docx has ~30mb, the regular .pdf has about 11mb, when i ocr'd it (with --force-ocr), it has ~70mb (600% bigger).

i installed 11.7.0 of ocrmypdf and did that command with --redo-ocr, the ToC is clickable, awesome!! Thanks! And also the size is the same as the libreoffice generated .pdf, perfect!

ocrmypdf / OCRmyPDF

Preserve hypertext links #605