latex3 / tagpdf

Tagging support code for LaTeX
60 stars 7 forks source link

Adobe Reader Bounding Box Changes with `interwordspace=on` #34

Closed TCWORLD closed 3 years ago

TCWORLD commented 3 years ago

I've been playing with tagpdf to see if we can make our LaTeX produced lab notes accessible, and I've been getting along generally OK with getting most of the parts tagged correctly (have written a load of wrappers for tagging various bits).

One thing that I have noticed as being a problem, is that when I select a different font (e.g. helvet package with all fonts set to sans) presumably due to font kerning, Adobe reader when it reads out the text sounds a bit like a Dalek as the PDF output seems to split words up into chunks - e.g. (some) becomes (som)(e) in the PDF stream. I will probably try using a different font to see if I can avoid that, but while investigating it flagged up another problem.

One way I've found to fix the weird splitting of words is to set interwordspace=on in the \tagpdfsetup which indeed results in all the words being properly read. However it introduces a problem that the bounding box of the paragraph highlighed by the reader changes. An example is shown in the image below (the LaTeX code to produce it is at the end of the issue):

Example output

Notice how in the left example when there is no interwordspace argument passed to \tagpdfsetup the bounding box correctly positions itself around the paragraph to be spoken.

Now when I pass in interwordspace=on (in fact interwordspace= anything), suddenly the bounding box of the paragraph changes to start at the bottom left corner of the page.

While this does not cause an issue for reading itself, it does mean whenever you click on a paragraph to read it, Adobe Reader scrolls down to the bottom of the page which is not ideal.

Is this an issue with tagpdf itself? Or something to do with the \pdfinterwordspaceon primitive?


Minimum Example Code:

\documentclass{article}

\usepackage{ifpdf}
\usepackage{etoolbox}

\usepackage{tagpdf}
\tagpdfsetup{uncompress, activate-all}%, interwordspace=on}

\begin{document}

\tagstructbegin{tag=Sect}
\tagstructbegin{tag=H1}%
    \tagmcbegin{tag=H1}%
        \section{Example Section}%
    \tagmcend%
\tagstructend

\tagstructbegin{tag=H2}%
    \tagmcbegin{tag=H2}
        \subsection{Example Subsection}
    \tagmcend
\tagstructend

\tagstructbegin{tag=P}
\tagmcbegin{tag=P}
I am some text in a paragraph.
\tagmcend
\tagstructend

\tagstructend%Sect

\end{document}
u-fischer commented 3 years ago

At start a warning: as the documentation says, tagpdf is an experimental package. And I mean this. Don't expect interfaces or behaviour to be stable. The current development is going on in the splitting branch (which I will merge at some time back into develop). This branch requires the splitting branch of the pdfresources project in the latex github, along with the newest latex (and some code at the begin of the document to activate the pdfresource management).

That said, I'm always interested to get some feedback about what works and what not.

I would say that you found a problem with \pdfinterwordspaceon. If you try an example without tagpdf, eg.

\RequirePackage{l3pdf}
\ExplSyntaxOn
\pdf_uncompress:
\ExplSyntaxOff

\documentclass{article}
\pdfglyphtounicode{space}{0020}
\begin{document}

\pdfinterwordspaceon

\noindent abc cde

\end{document}

and look in the pdf you can see that a space ( ) is inserted before the main displacement 133.768 707.125 Td.

BT
/F21 9.9626 Tf/F30 1 Tf( )Tj/F21 9.9626 Tf 133.768 707.125 Td [(ab)-28(c)]TJ
....

I found no sensible way to avoid this. It happens even after an \noindent or \leavevmode\pdfinterwordspaceon. only if some other char is printed first it worked.

With lualatex (which uses a quite different method) there is no problem.

TCWORLD commented 3 years ago

Thanks for the quick response. I fully understand it's experimental and have no expectation of stability.

I've switched over the document to using LuaLaTeX (only required changing a couple of package includes, so not as bad as expected) and indeed it works perfectly with interwordspace on.

Once I've got what I need working, I'll share the wrapper I've written. it's a bit clunky but it might give some useful feedback as to how the package is being used.

car222222 commented 3 years ago

@u-fischer Is it clear to you why/how such a misplaced space character leads to this exact wrong behaviour of the 'bounding box' as it appears in the reader?

u-fischer commented 3 years ago

@car222222 sure. The space char is in the lower left edge. So the reader is quite right to be confused by it (acrobat pro seems to ignore it)

car222222 commented 3 years ago

Aha yes! And it is also 'within the paragraph'. I can picture it in my mind now. Thanks.

Interesting that Pro appears to interpret it differently. Should we tell someone about that difference?

u-fischer commented 3 years ago

A fix has been added to the pdftex sources, I tested with the updated binaries from w32tex.org, and it seems to work fine now.