Open mnjames opened 6 years ago
Never mind Arabic, I can't reliably copy/paste out of a PDF in Latin alphabet based languages!
I've heard of this feature in PDFs before but never played around with it. How widespread is reader support? Do you happen to know of a chart somewhere that shows what readers do or don't support PDF features like this?
I haven’t been able to find much documentation on it. From myself and one other user I can currently report:
Adobe Reader DC – works
Foxit Reader – doesn’t work
Foxit (linux version) – doesn’t work
qpdfview (linux) – doesn’t work
evince (linux) – doesn’t work
So, it looks like it isn’t supported by many readers. On the other hand, I’m assuming that Adobe represents that major share of the pdf reader population.
--Malachi
From: Caleb Maclennan [mailto:notifications@github.com] Sent: Tuesday, November 21, 2017 11:16 To: simoncozens/sile sile@noreply.github.com Cc: mnjames mjames@wordmail.net; Author author@noreply.github.com Subject: Re: [simoncozens/sile] Use pdf "/ActualText" feature (#494)
Never mind Arabic, I can't reliably copy/paste out of a PDF in Latin alphabet based languages!
I've heard of this feature in PDFs before but never played around with it. How widespread is reader support? Do you happen to know of a chart somewhere that shows what readers do or don't support PDF features like this?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/simoncozens/sile/issues/494#issuecomment-345927983 , or mute the thread https://github.com/notifications/unsubscribe-auth/AcxrRjwmBwJLB10DmHzW4LikO2FcuDuGks5s4mqugaJpZM4Qkdhz . https://github.com/notifications/beacon/AcxrRuG-JmT7k8iwB1GTWPeMoEsy3ZBuks5s4mqugaJpZM4Qkdhz.gif
There is some support for this through the pdfstructure
package. (Linking to #110) Unfortunately I didn't document it and can't remember what it does. But I think if you include pdfstructure
, it should automatically generate ActualText.
I haven’t been able to find much documentation on it. From myself and one other user I can currently report: evince (linux) – doesn’t work
Edit: evince supports this now. I'm not sure about qpdfview (I couldn't figure out how to copy text)
Edit 2: The only problem with enabling \XeTeXgenerateactualtext is that when you select text (to copy) is turns invivisble and only shows some squares possibly indicating missing characters.
I know I'm posting this 5 years later....but....
I was writing a game-list (pdf) through latex (using a script to find the games and generating a table).
Without specifiying \XeTeXgenerateactualtext=1, in the tex file, any text containing a plain dash would show in the pdf but would not be present when copied and pasted elsewhere.
After generating th pdf with the setting active, evince (as of now) has actual dashes in the text that are able to be copy/pasted as one would expect.
PS: I see no reason why a feature like this shouldn't be turned on by default -- if a reader doesn't support the feature then it should, IMHO, simply ignore it and display whatever it would have shown previously.
Long story short: 1) \XeTeXgenerateactualtext=1 could solve an issue with unicode text copy/paste, 2) It might make the text invisible when selected (happened in evince)
For my use-case -- plain dashes were not being copied and I didn't like the text turning invisible when selected.
So I ultimately used the ascii package and replaced all dashes with \textascii{\char"2D}
I'm not sure about qpdfview (I couldn't figure out how to copy text)
In qpdfview, you can press control+C
, select with the mouse the area containing text, and then choose "copy text".
I'm not sure about qpdfview (I couldn't figure out how to copy text)
In qpdfview, you can press
control+C
, select with the mouse the area containing text, and then choose "copy text".
The text was copied correctly in qpdfview both with and without \XeTeXgenerateactualtext=1
So, it does look like this is purely a PDF-viewer issue (very similar to the old issue of what css features does a browser support) -- and not releated to LaTex, Sile, or xelatex. etc.
See somewhat related discussion https://github.com/sile-typesetter/sile/discussions/1927
For the mere record, I experimented bringing directly /ActualText
in the libtexpdf outputter around text boxes, as I suggested in a discussion some time ago: https://github.com/sile-typesetter/sile/discussions/1927#discussioncomment-7862825
Then, search (and copy) work well in Evince (before, it would fail on the fi ligature...):
But when selecting the text, it shows ugly things...
It might be an Evince-only problem (using v46.0) -- Okular (using v24.05.2) doesn't have this problem (= it also failed to find/copy the fi ligature, but with the suggested code change everything seems fine)
So I'm unsure it's a PDF-viewer problem or there's some deeper issue in this /ActualText
naive approach.
N.B. The "naive" patch:
diff --git a/outputters/libtexpdf.lua b/outputters/libtexpdf.lua
index c7f7d42b..cf7c8c60 100644
--- a/outputters/libtexpdf.lua
+++ b/outputters/libtexpdf.lua
@@ -132,6 +132,8 @@ function outputter:drawHbox (value, width)
if not value.glyphString then
return
end
+ local txt = SU.utf8_to_utf16be_hexencoded(value.text)
+ pdf.add_content("/Span << /ActualText <" .. txt .. "> >>\nBDC\n")
-- Nodes which require kerning or have offsets to the glyph
-- position should be output a glyph at a time. We pass the
-- glyph advance from the htmx table, so that libtexpdf knows
@@ -155,6 +157,7 @@ function outputter:drawHbox (value, width)
buf = table.concat(buf, "")
self:_drawString(buf, width, 0, 0)
end
+ pdf.add_content("\nEMC")
end
function outputter:_withDebugFont (callback)
The pdf standard includes a command called /ActualText which allows you to include the unicode text along with the normally occurring glyphs in the pdf. This is wonderful for Arabic and other non-Latin languages that have never had the ability to copy-paste out of pdfs.
XeTeX added the command "\XeTeXgenerateactualtext=1" a year or so ago so that pdfs encoded through it would include the ActualText data in them.
Is it possible to add a similar feature to SILE?