Open philagee opened 4 years ago
Can you provide a file that demonstrates this and the command line you used?
There are multiple ways software can create a hyperlink in a PDF.
On Thu., Jul. 30, 2020, 13:47 Phil Agee, notifications@github.com wrote:
Problem: Output files from OCR (output-type=pdf or pdfa, without or without -f, without or without -c) lose hypertext links associated to text (e.g. hyperlinks for table of contents entries, inline footnote numbers, inline endnote numbers that jump to page, footnote text, or endnote text). Text in output file is no longer hyperlinked.
Solution: Preserve hyperlinks so that text in output file is hyperlinked as defined in input file.
Thank you for an amazing tool!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN5YM4BVPZLYP3TSE2Y4F3R6HL55ANCNFSM4POT6LJQ .
Attached is a file created using the following command line:
ocrmypdf -f
Clicking on the entry for chapter one in the table of contents does not navigate to page where the chapter begins.
pride-and-prejudice-ocrmypdf.pdf .
Thanks!
By design --force-ocr
is going to discard hyperlinks and other active content. A major use case of this feature is getting as much content out of possible out of damaged PDF files, such as those that have missing Unicode tables. (This feature is not well named, I admit, but it's also hard to describe in a word or two.)
I would still like to look at the case without --force-ocr
. Do you have the original input file, before processing with ocrmypdf? It looks like you send me the one after processing. I'd like to compare to see what got dropped along the way.
Attached is the original.
I tried converting without any options and got the notice about already existing text (PriorOcrFoundError: page already has text!). So I then added --force-ocr.
--force-ocr
is going discard hyperlinks, that is intended behavior.
If you use --skip-text
to skip OCR on pages that already have printable text (i.e. all pages on this file), and --output-type pdf
to disable PDF/A conversion, the hyperlinks appear and still work.
It seems that Ghostscript's PDF/A conversion removes links, even with -dPrinted=false
which as explained here should prevent links from being deleted.
I suppose my answer needs to be that, if you want to keep hyperlinks, use --output-type pdf
instead of the default. I don't think I will escalate this to Ghostscript.
I can confirm that --skip-text and --output-type pdf preserve the hyperlinks.
Thank you again for making and supporting this extremely useful tool!
I have a .pdf that contains on nearly every page text and a picture. Only applying --output-type pdf destroys the ToC's hyperlinks, its required to do --skip-text. Which leads to the situation that nothing is ocr'd ... Thanks for your tool, i really appreciate that, but thats unfortunately a killer for me :/
ocrmypdf 20HS_AFM_Zusammenfassung.pdf 20HS_AFM_Zusammenfassung.ocr.pdf --output-type pdf --skip-text
Scanning contents: 100%|█████████████████████| 95/95 [00:00<00:00, 399.12page/s]
Start processing 16 pages concurrently
1 skipping all processing on this page
3 skipping all processing on this page
5 skipping all processing on this page
8 skipping all processing on this page
4 skipping all processing on this page
9 skipping all processing on this page
17 skipping all processing on this page
10 skipping all processing on this page
6 skipping all processing on this page
19 skipping all processing on this page
7 skipping all processing on this page
20 skipping all processing on this page
21 skipping all processing on this page
2 skipping all processing on this page
18 skipping all processing on this page
16 skipping all processing on this page
24 skipping all processing on this page
22 skipping all processing on this page
23 skipping all processing on this page
11 skipping all processing on this page
13 skipping all processing on this page
12 skipping all processing on this page
14 skipping all processing on this page
15 skipping all processing on this page
25 skipping all processing on this page
27 skipping all processing on this page
28 skipping all processing on this page
29 skipping all processing on this page
26 skipping all processing on this page
30 skipping all processing on this page
31 skipping all processing on this page
32 skipping all processing on this page
33 skipping all processing on this page
34 skipping all processing on this page
35 skipping all processing on this page
37 skipping all processing on this page
38 skipping all processing on this page
39 skipping all processing on this page
40 skipping all processing on this page
42 skipping all processing on this page
41 skipping all processing on this page
OCR: 100%|██████████████████████████████| 95.0/95.0 [00:00<00:00, 2896.16page/s]
36 skipping all processing on this page
43 skipping all processing on this page
44 skipping all processing on this page
51 skipping all processing on this page
50 skipping all processing on this page
59 skipping all processing on this page
57 skipping all processing on this page
46 skipping all processing on this page
53 skipping all processing on this page
60 skipping all processing on this page
48 skipping all processing on this page
47 skipping all processing on this page
55 skipping all processing on this page
54 skipping all processing on this page
62 skipping all processing on this page
61 skipping all processing on this page
49 skipping all processing on this page
45 skipping all processing on this page
56 skipping all processing on this page
52 skipping all processing on this page
58 skipping all processing on this page
63 skipping all processing on this page
64 skipping all processing on this page
66 skipping all processing on this page
65 skipping all processing on this page
67 skipping all processing on this page
68 skipping all processing on this page
69 skipping all processing on this page
70 skipping all processing on this page
71 skipping all processing on this page
72 skipping all processing on this page
73 skipping all processing on this page
74 skipping all processing on this page
75 skipping all processing on this page
76 skipping all processing on this page
77 skipping all processing on this page
78 skipping all processing on this page
79 skipping all processing on this page
80 skipping all processing on this page
81 skipping all processing on this page
82 skipping all processing on this page
83 skipping all processing on this page
84 skipping all processing on this page
86 skipping all processing on this page
85 skipping all processing on this page
88 skipping all processing on this page
89 skipping all processing on this page
93 skipping all processing on this page
90 skipping all processing on this page
91 skipping all processing on this page
87 skipping all processing on this page
92 skipping all processing on this page
95 skipping all processing on this page
94 skipping all processing on this page
JPEGs: 0image [00:00, ?image/s]
JBIG2: 0item [00:00, ?item/s]
Optimize ratio: 0.99 savings: -0.7%
Image optimization did not improve the file - discarded
also the .docx has ~30mb, the regular .pdf has about 11mb, when i ocr'd it (with --force-ocr), it has ~70mb (600% bigger).
i installed 11.7.0 of ocrmypdf and did that command with --redo-ocr, the ToC is clickable, awesome!! Thanks! And also the size is the same as the libreoffice generated .pdf, perfect!
Problem: Output files from OCR (output-type=pdf or pdfa, without or without -f, without or without -c) lose hypertext links associated to text (e.g. hyperlinks for table of contents entries, inline footnote numbers, inline endnote numbers that jump to page, footnote text, or endnote text). Text in output file is no longer hyperlinked.
Solution: Preserve hyperlinks so that text in output file is hyperlinked as defined in input file.
Thank you for an amazing tool!