latex3 / pdfresources

LaTeX PDF resource management
LaTeX Project Public License v1.3c
22 stars 5 forks source link

Wrong copy-paste #19

Closed dbitouze closed 3 years ago

dbitouze commented 3 years ago

I don't know whether pdfmanagement-testphase is the culprit or not ;) but, anyway: copying the code between "Or through a dictionary:" and "Or if you want to exclude the possibility [...]" page 3 of the l3pdfannot module's documentation:

\pdfdict_new:n {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\pdfannot_dict_put:nnn {link/URI} { C } {[1~0~0]} %red border
\pdfannot_link:nxn { URI }
{
/A <<\pdfdict_use:n{l_my_action_dict}>>
}
{ link text }

is pasted as:

\pdfdict_new:n
\pdfdict_put:nnn
\pdfdict_put:nnn
\pdfdict_put:nnn
{l_my_action_dict}
{l_my_action_dict}{Type}{/Action}
{l_my_action_dict}{S}{/URI}
{l_my_action_dict}{URI}{(https://www.latex-project.org)}
\pdfannot_dict_put:nnn
{link/URI} { C } {[1~0~0]} %red border
\pdfannot_link:nxn { URI }
{
/A <<\pdfdict_use:n{l_my_action_dict}>>
}
{ link text }

(Tested on Linux with several PDF readers: Zathura, Okular and Evince.)

u-fischer commented 3 years ago

If you do copy & paste of an untagged pdf the reading order is decided by the pdf viewer with some heuristics. In this case your viewers probably consider the alignment as an indication of two column mode and so resorts the text.

As I don't want to remove the nice aligment, you will have to wait for the progress of the tagged PDF project to get a better result here ;-). But even with a fully tagged pdf I wouldn't fully trust copy & paste. Imho pdf viewer don't care much about code and so sometimes drop spaces or remove new lines.

dbitouze commented 3 years ago

But even with a fully tagged pdf I wouldn't fully trust copy & paste. Imho pdf viewer don't care much about code and so sometimes drop spaces or remove new lines.

Do you mean you expect the readers of (LaTeX) documentations to type by hand all the source codes they want to test (with, among other, the misspelling risks), in the above example 350 characters?! This is not very engaging ;)

u-fischer commented 3 years ago

I expect readers of LaTeX documentation to know that there is a source ... ;-)

But beside this: imho it is more reliable to embed/attach code as file, so in a tagged pdf I would try to add it as associated file.

dbitouze commented 3 years ago

BTW, the trouble can be avoided with the package listings (and the columns=flexible option):

\documentclass{article}
\usepackage{listings}
\lstset{basicstyle=\ttfamily,columns=flexible}
\begin{document}

\begin{verbatim}
\pdfdict_new:n   {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\end{verbatim}

\begin{lstlisting}
\pdfdict_new:n   {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\end{lstlisting}
\end{document}
u-fischer commented 3 years ago

Yes, but this destroys the alignment. That was what I meant above,

FrankMittelbach commented 3 years ago

(Tested on Linux with several PDF readers: Zathura, Okular and Evince.)

shows that all the readers you use apply strange heuristics. Here is what I get (pdfexpert is not perfect, but reasonable):

\pdfdict_new:n {l_my_action_dict} \pdfdict_put:nnn {l_my_action_dict}{Type}{/Action} \pdfdict_put:nnn {l_my_action_dict}{S}{/URI} \pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}

\pdfannot_dict_put:nnn {link/URI} { C } {[1~0~0]} %red border

\pdfannot_link:nxn { URI }

{ /A <<\pdfdict_use:n{l_my_action_dict}>> } { link text }
\pdfdict_new:n   {l_my_action_dict}
   \pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
   \pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
   \pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
   \pdfannot_dict_put:nnn
     {link/URI} { C } {[1~0~0]} %red border
   \pdfannot_link:nxn { URI }
    {
      /A <<\pdfdict_use:n{l_my_action_dict}>>
    }
{ link text }
\pdfdict_new:n {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\pdfannot_dict_put:nnn
{link/URI} { C } {[1~0~0]} %red border
\pdfannot_link:nxn { URI }
{
/A <<\pdfdict_use:n{l_my_action_dict}>>
}{
link text }
dbitouze commented 3 years ago

all the readers you use apply strange heuristics

If I'm right, all of them are poppler based. I'll open a bug report there.

dbitouze commented 3 years ago

If I'm right, all of them are poppler based. I'll open a bug report there.

Considered as a priori non-fixable by a poppler developer :frowning_face:

car222222 commented 3 years ago

Well, as Albert Astals Cid wrote:

"getting text from PDF files is a guessing game".

This applies both locally (what characters to paste) and globally (what to include and in what order).

I have a large menagerie of strange examples of "interesting results from copy-and-paste".

FrankMittelbach commented 3 years ago

I have a large menagerie of strange examples of "interesting results from copy-and-paste".

in what format do you have them? They could be valuable as test cases

FrankMittelbach commented 3 years ago

If I'm right, all of them are poppler based. I'll open a bug report there.

Considered as a priori non-fixable by a poppler developer ☹️

interesting, given that all other viewers get their heuristic right. But he is right of course in the sense that without approprite internal tagging guess will always have edge cases where they fail.

blefloch commented 3 years ago

On 6/23/21 10:21 AM, Frank Mittelbach wrote:

    If I'm right, all of them are |poppler| based. I'll open a bug report there.

Considered as a priori non-fixable by a |poppler| developer
<https://gitlab.freedesktop.org/poppler/poppler/-/issues/1093#note_968375> ☹️

interesting, given that all other viewers get their heuristic right. But he is right of course in the sense that without approprite internal tagging guess will always have edge cases where they fail.

What's an appropriate tagging for this? I tried simply adding some \pdffakespace to the definition of @.***, with the example code from the pdftex documentation, and this still failed to copy-paste properly.

Bruno

u-fischer commented 3 years ago

@blefloch something like this should work, at least with the reading order (but I can't test if the affected pdf viewer actually understand tagged pdf)

\RequirePackage{pdfmanagement-testphase}
\DeclareDocumentMetadata{uncompress}

\documentclass{article}
\usepackage{listings}
\lstset{basicstyle=\ttfamily,columns=flexible}
\usepackage{tagpdf}
\tagpdfsetup{activate-all,interwordspace=true,paratagging}
\begin{document}
\tagstructbegin{tag=Document}
\tagstructbegin{tag=Code}
\begin{verbatim}
\pdfdict_new:n   {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\end{verbatim}
\tagstructend

\tagstructbegin{tag=Code}
\begin{lstlisting}
\pdfdict_new:n   {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\end{lstlisting}
\tagstructend
\tagstructend
\end{document}
dbitouze commented 3 years ago

something like this should work, at least with the reading order (but I can't test if the affected pdf viewer actually understand tagged pdf)

On Linux with Zathura, Okular and Evince, still:

u-fischer commented 3 years ago

@dbitouze then you could make a new bug report and ask if they support tagged pdf (at best compile with lualatex and at least twice just to be sure the structure is right and there ...) And you need a current latex, paratagging works only with it.

dbitouze commented 3 years ago

@dbitouze then you could make a new bug report and ask if they support tagged pdf (at best compile with lualatex and at least twice just to be sure the structure is right and there ...)

I'll do it and let know here.

And you need a current latex, paratagging works only with it.

Is it OK with:

$ lualatex test   
This is LuaHBTeX, Version 1.13.2 (TeX Live 2021) 
 restricted system commands enabled.
(./test.tex
LaTeX2e <2021-06-01> patch level 1
 L3 programming layer <2021-06-18>

and:

 *File List*
pdfmanagement-testphase.sty    2021-06-14 v0.95e LaTeX PDF management testphase bundle
pdfmanagement-testphase.ltx    2021-06-14 v0.95e PDF management code (testphase)
l3bitset.sty    2021-05-27 L3 Experimental bitset support
   expl3.sty    2021-06-18 L3 programming layer (loader) 
l3backend-luatex.def    2021-05-07 L3 backend support: PDF output (LuaTeX)
l3backend-testphase-luatex.def    2021-06-14 LaTeX PDF management testphase bun
dle backend support:PDFoutput(LuaTeX)
l3ref-tmp.sty    2020-10-09 L3 Experimental cross-referencing
pdfmanagement-firstaid.sty    2021-06-14 v0.95e LaTeX PDF management testphase 
bundle / firstaid-patches
 article.cls    2021/02/12 v1.4n Standard LaTeX document class
  size10.clo    2021/02/12 v1.4n Standard LaTeX file (size option)
listings.sty    2020/03/24 1.8d (Carsten Heinz)
  keyval.sty    2014/10/28 v1.15 key=value parser (DPC)
 lstmisc.sty    2020/03/24 1.8d (Carsten Heinz)
listings.cfg    2020/03/24 1.8d listings configuration
  tagpdf.sty    2021-06-14 v0.82  A package to experiment with pdf tagging 
etoolbox.sty    2020/10/05 v2.5k e-TeX tools for LaTeX (JAW)
tagpdf-luatex.def    2021-06-14 v0.82 tagpdf driver for luatex
tagpdf-checks-code.sty    2021-06-14 v0.82 part of tagpdf - code related to che
cks and messages
tagpdf-user.sty    2021-06-14 v0.82 tagpdf - user commands
tagpdf-tree-code.sty    2021-06-14 v0.82 part of tagpdf - code related to writi
ng trees and dictionaries to the pdf
tagpdf-roles-code.sty    2021-06-14 v0.82 part of tagpdf - code related to role
s and structure names
tagpdf-attr-code.sty    2021-06-14 v0.82 part of tagpdf - code related to attri
butes and attribute classes
tagpdf-mc-code-shared.sty    2021-06-14 v0.82 part of tagpdf - code related to 
marking chunks - code shared by generic and luamode 
tagpdf-mc-code-lua.sty    2021-06-14 v0.82 tagpdf - mc code only for the luamod
e 
tagpdf-struct-code.sty    2021-06-14 v0.82 part of tagpdf - code related to sto
ring structure
tagpdf-space-code.sty    2021-06-14 v0.82 part of tagpdf - code related to real
 space chars
  ts1cmr.fd    2019/12/16 v2.5j Standard LaTeX font definitions
 ***********
dbitouze commented 3 years ago

Interestingly, the test file provided by Ulrike fails as I said, since it is pasted as:

\pdfdict_new:n
\pdfdict_put:nnn
\pdfdict_put:nnn
\pdfdict_put:nnn
{l_my_action_dict}
{l_my_action_dict}{Type}{/Action}
{l_my_action_dict}{S}{/URI}
{l_my_action_dict}{URI}{(https://www.latex-project.org)}
\pdfdict_new:n {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}

but the following one:

\RequirePackage{pdfmanagement-testphase}
\DeclareDocumentMetadata{uncompress}

\documentclass{article}
\usepackage{listings}
\lstset{basicstyle=\ttfamily,columns=flexible}
\usepackage{tagpdf}
\tagpdfsetup{activate-all,interwordspace=true,paratagging}
\begin{document}
\section{First code} % <-- Here is the 1st difference
\tagstructbegin{tag=Document}
\tagstructbegin{tag=Code}
\begin{verbatim}
\pdfdict_new:n   {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\end{verbatim}
\tagstructend
\section{Second code} % <-- Here is the 2nd difference
\tagstructbegin{tag=Code}
\begin{lstlisting}
\pdfdict_new:n   {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
\end{lstlisting}
\tagstructend
\tagstructend
\end{document}

is less wrong:

1 First code
\pdfdict_new:n
 {l_my_action_dict}
\pdfdict_put:nnn
 {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn
 {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn
 {l_my_action_dict}{URI}{(https://www.latex-project.org)}
2 Second code
\pdfdict_new:n {l_my_action_dict}
\pdfdict_put:nnn {l_my_action_dict}{Type}{/Action}
\pdfdict_put:nnn {l_my_action_dict}{S}{/URI}
\pdfdict_put:nnn {l_my_action_dict}{URI}{(https://www.latex-project.org)}
dbitouze commented 3 years ago

is less wrong:

Well... only with Okular (still fails with Zathura and Evince).

u-fischer commented 3 years ago

Is it OK with:

Yes, should be fine. You could also add the option "paratagging-show", if you get lots of small red numbers paratagging works. Or you could check if there are tags at https://www.ngpdf.com/.

is less wrong:

well heuristics are heuristics. The reader are trying to guess if it is a two column document or not, an naturally everything on the page is taken into account.