vedang / pdf-tools

Emacs support library for PDF files.
https://pdftools.wiki
GNU General Public License v3.0
620 stars 90 forks source link

Transparent text selection #149

Closed workcomplete closed 2 years ago

workcomplete commented 2 years ago

I'll preface this by saying I am not sure if this is an issue with pdf-tools specifically, vs a more general issue with tesseract OCR/poppler. I originally posted this in discussions, but I don't think many people are using this feature atm.

When I select text (i.e. click and drag mouse over a region of text) in pdf-tools, the text selection is not transparent. For pdf's that are created digitally this is fine, but for pdf's that are scanned and OCR'd with tesseract, the selected text becomes hidden behind the selection (see images below). Is there a way to change this behavior without re-OCRing the pdf with something like adobe acrobat (i.e. is it possible make the text selection transparent in pdf-tools)?

What selected text looks like in a pdf generated from a website: Screenshot from 2022-09-08 20-27-51

What selected text looks like in a scanned book that has been OCRd with tesseract

Screenshot from 2022-09-08 19-07-11

OCR text can still be copied and pasted and looks fine when markup is applied.

Screenshot from 2022-09-08 19-07-55

Originally posted by @workcomplete in https://github.com/vedang/pdf-tools/discussions/147

If I understand this post on stack exchange, poppler has already implemented transparent text selection?

https://tex.stackexchange.com/questions/565909/invisible-text-even-when-selected-in-evince

This same issue with pdf-tools (under previous maintainer) is mentioned here

https://gitlab.freedesktop.org/poppler/poppler/-/issues/157

orgtre commented 2 years ago

Ok, I can reproduce this using poppler 22.08 with the linn.pdf attached to the poppler issue report linked above. Still, highlighted text displays correctly. In Evince the behavior is exactly the same as in pdf-tools (non-transparent when selected, but transparent when highlighted). In several other non-popper-based pdf readers the selections are transparent, which to me indicates that this is a poppler issue (despite the fact that the issue linked above is closed). Not sure if there is something we can do on the pdf-tools side.

workcomplete commented 2 years ago

OK well I found a work around (below) but agree that this seems to be a poppler issue of not handling 3 Tras it is purported to in the chosen answer in the stackexchange question above.

By following the instructions here up to step 2, where instead replacing 3 Tr with 1 Tr, I replaced 3 Tr with 7 Tr. This makes the text visible when selected (see image below). The remaining steps are not necessary. The downside here is that I download many books from online resources, and 80% of the time they are already OCR'd with tesseract. Needing to do this for every text is a bit painful.

Screenshot from 2022-09-15 11-40-20

orgtre commented 2 years ago

This is a very useful discovery and should probably be communicated somewhere else too!

To make the first step slightly easier one could use my qpdf transient wrapper qpdf.el. In fact the whole procedure can be automated with the following command:

(defun my-fix-pdf-selection ()
  "Replace pdf with one where selection shows transparently."
  (interactive)
  (unless (equal (file-name-extension (buffer-file-name)) "pdf")
    (error "Buffer should visit a pdf file."))
  (unless (equal major-mode 'pdf-view-mode)
    (pdf-view-mode))
  ;; save file in QDF-mode
  (qpdf-run (list
         (concat "--infile="
             (buffer-file-name))
         "--qdf --object-streams=disable"
         "--replace-input"))
  ;; do replacements
  (text-mode)
  (read-only-mode -1)
  (while (re-search-forward "3 Tr" nil t)
    (replace-match "7 Tr" nil nil))
  (save-buffer)
  (pdf-view-mode))

This still depends on qpdf.el but with a few extra lines of code one could avoid that. Note that this overwrites the pdf file visited in the buffer from which it is run! To avoid this replace the first "--replace-input" with (concat "--outfile=" (file-truename (read-file-name "Outfile: "))). A caveat is that in the pdf's I tested selecting is substantially slower after running my-fix-pdf-selection, fix-qdf doesn't help.

workcomplete commented 2 years ago

Great, thanks for sharing your package and function.

Reopening as it could be helpful if your function was added to the wiki, or at the very least if this could be added to known issues.

One small issue with your function above: when tested it only worked on file names that do not contain spaces. Not a major issue for me as I use zotfile/zotero to rename my pdfs.

And FWIW I have not noticed the same issue with text selection lag between "fixed" and unfixed pdfs, both are quite slow for me (compared to other PDF readers) as noted in https://github.com/vedang/pdf-tools/issues/87#issue-1185198366

orgtre commented 2 years ago

Thanks for testing. I fixed the issue with spaces in file names, plus added the problem and workaround to known problems in the README. If I understand correctly, once the pull request is merged, pdftools.wiki (which just mirrors the README) will rebuild automatically to reflect the changes. I tried to work around the qpdf.el dependency, but to make it behave correctly in all cases, basically the whole qpdf--default-run-after-function is needed, so I decided to keep the dependency after all.

workcomplete commented 2 years ago

Thanks again for your work on this! FYI the function recently stopped working on my setup, it appears to be inserting a trailing '' in the qpdf command.

Minibuffer contents after running my-fix-pdf-selection:

call: qpdf '/home/user/Zotero/storage/Z4JZ82SF/some_file.pdf' --qdf --object-streams=disable --replace-input ''

When I copy this into a terminal window and remove the trailing '' the command runs as expected.

vedang commented 2 years ago

@workcomplete : ^ This might be a bug introduced in the latest commit on qpdf: https://github.com/orgtre/qpdf.el/commit/6debdceef657691d678dcb24f91837f195d1b6b6 cc: @orgtre . I'm saying this after a cursory look at the linked commit, which adds a quotes around outfile (and in the command above there is no outfile argument). I might be wrong, I haven't actually tested the code.

I am going ahead and merging the PR which explains this as a known problem (which will also update the wiki).

Thank you!

vedang commented 2 years ago

closed via vedang/pdf-tools@aec8ecd0066ba2cbfefed1330a71deada89f5b5a

orgtre commented 2 years ago

I merged vedang's fix. Thanks both of you!