sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.65k stars 98 forks source link

Use pdf "/ActualText" feature #494

Open mnjames opened 6 years ago

mnjames commented 6 years ago

The pdf standard includes a command called /ActualText which allows you to include the unicode text along with the normally occurring glyphs in the pdf. This is wonderful for Arabic and other non-Latin languages that have never had the ability to copy-paste out of pdfs.

XeTeX added the command "\XeTeXgenerateactualtext=1" a year or so ago so that pdfs encoded through it would include the ActualText data in them.

Is it possible to add a similar feature to SILE?

alerque commented 6 years ago

Never mind Arabic, I can't reliably copy/paste out of a PDF in Latin alphabet based languages!

I've heard of this feature in PDFs before but never played around with it. How widespread is reader support? Do you happen to know of a chart somewhere that shows what readers do or don't support PDF features like this?

mnjames commented 6 years ago

I haven’t been able to find much documentation on it. From myself and one other user I can currently report:

Adobe Reader DC – works

Foxit Reader – doesn’t work

Foxit (linux version) – doesn’t work

qpdfview (linux) – doesn’t work

evince (linux) – doesn’t work

So, it looks like it isn’t supported by many readers. On the other hand, I’m assuming that Adobe represents that major share of the pdf reader population.

--Malachi

From: Caleb Maclennan [mailto:notifications@github.com] Sent: Tuesday, November 21, 2017 11:16 To: simoncozens/sile sile@noreply.github.com Cc: mnjames mjames@wordmail.net; Author author@noreply.github.com Subject: Re: [simoncozens/sile] Use pdf "/ActualText" feature (#494)

Never mind Arabic, I can't reliably copy/paste out of a PDF in Latin alphabet based languages!

I've heard of this feature in PDFs before but never played around with it. How widespread is reader support? Do you happen to know of a chart somewhere that shows what readers do or don't support PDF features like this?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simoncozens/sile/issues/494#issuecomment-345927983 , or mute the thread https://github.com/notifications/unsubscribe-auth/AcxrRjwmBwJLB10DmHzW4LikO2FcuDuGks5s4mqugaJpZM4Qkdhz . https://github.com/notifications/beacon/AcxrRuG-JmT7k8iwB1GTWPeMoEsy3ZBuks5s4mqugaJpZM4Qkdhz.gif

simoncozens commented 6 years ago

There is some support for this through the pdfstructure package. (Linking to #110) Unfortunately I didn't document it and can't remember what it does. But I think if you include pdfstructure, it should automatically generate ActualText.

neoh4x0r commented 1 year ago

I haven’t been able to find much documentation on it. From myself and one other user I can currently report: evince (linux) – doesn’t work

Edit: evince supports this now. I'm not sure about qpdfview (I couldn't figure out how to copy text)

Edit 2: The only problem with enabling \XeTeXgenerateactualtext is that when you select text (to copy) is turns invivisble and only shows some squares possibly indicating missing characters.


I know I'm posting this 5 years later....but....

I was writing a game-list (pdf) through latex (using a script to find the games and generating a table).

Without specifiying \XeTeXgenerateactualtext=1, in the tex file, any text containing a plain dash would show in the pdf but would not be present when copied and pasted elsewhere.

After generating th pdf with the setting active, evince (as of now) has actual dashes in the text that are able to be copy/pasted as one would expect.

PS: I see no reason why a feature like this shouldn't be turned on by default -- if a reader doesn't support the feature then it should, IMHO, simply ignore it and display whatever it would have shown previously.

Long story short: 1) \XeTeXgenerateactualtext=1 could solve an issue with unicode text copy/paste, 2) It might make the text invisible when selected (happened in evince)

For my use-case -- plain dashes were not being copied and I didn't like the text turning invisible when selected.

So I ultimately used the ascii package and replaced all dashes with \textascii{\char"2D}

leorosa commented 1 year ago

I'm not sure about qpdfview (I couldn't figure out how to copy text)

In qpdfview, you can press control+C , select with the mouse the area containing text, and then choose "copy text".

neoh4x0r commented 1 year ago

I'm not sure about qpdfview (I couldn't figure out how to copy text)

In qpdfview, you can press control+C , select with the mouse the area containing text, and then choose "copy text".

The text was copied correctly in qpdfview both with and without \XeTeXgenerateactualtext=1

So, it does look like this is purely a PDF-viewer issue (very similar to the old issue of what css features does a browser support) -- and not releated to LaTex, Sile, or xelatex. etc.

Omikhleia commented 9 months ago

See somewhat related discussion https://github.com/sile-typesetter/sile/discussions/1927

Omikhleia commented 3 weeks ago

For the mere record, I experimented bringing directly /ActualText in the libtexpdf outputter around text boxes, as I suggested in a discussion some time ago: https://github.com/sile-typesetter/sile/discussions/1927#discussioncomment-7862825

Then, search (and copy) work well in Evince (before, it would fail on the fi ligature...):

image

But when selecting the text, it shows ugly things...

image

It might be an Evince-only problem (using v46.0) -- Okular (using v24.05.2) doesn't have this problem (= it also failed to find/copy the fi ligature, but with the suggested code change everything seems fine)

image

So I'm unsure it's a PDF-viewer problem or there's some deeper issue in this /ActualText naive approach.

Omikhleia commented 3 weeks ago

N.B. The "naive" patch:

diff --git a/outputters/libtexpdf.lua b/outputters/libtexpdf.lua
index c7f7d42b..cf7c8c60 100644
--- a/outputters/libtexpdf.lua
+++ b/outputters/libtexpdf.lua
@@ -132,6 +132,8 @@ function outputter:drawHbox (value, width)
    if not value.glyphString then
       return
    end
+   local txt = SU.utf8_to_utf16be_hexencoded(value.text)
+   pdf.add_content("/Span << /ActualText <" .. txt .. "> >>\nBDC\n")
    -- Nodes which require kerning or have offsets to the glyph
    -- position should be output a glyph at a time. We pass the
    -- glyph advance from the htmx table, so that libtexpdf knows
@@ -155,6 +157,7 @@ function outputter:drawHbox (value, width)
       buf = table.concat(buf, "")
       self:_drawString(buf, width, 0, 0)
    end
+   pdf.add_content("\nEMC")
 end

 function outputter:_withDebugFont (callback)