latex3 / tagging-project

Issues related to the tagging project
https://latex3.github.io/tagging-project/
LaTeX Project Public License v1.3c
42 stars 15 forks source link

spaces lost around `\text<xx>` commands with pdflatex #255

Open mbertucci47 opened 4 months ago

mbertucci47 commented 4 months ago

With pdflatex, spaces are lost around \text<xx> commands at the pdf level, by which I mean in the tags and when copying/pasting. Here's an example.

\DocumentMetadata
  {
    lang=en-US,
    pdfversion=2.0,
    pdfstandard=ua-2,
    testphase={phase-III}
  }

\documentclass{article}

\begin{document}

normal text

normal text \textit{italic text} normal text

normal text \textbf{bold text} normal text

normal text \textsl{slanted text} normal text

\end{document}
spaceing

When you copy and paste the text from the pdf you get

normal text
normal textitalic textnormal text
normal textbold textnormal text
normal textslanted textnormal text

With lualatex this does not occur, nor does it occur if tagging code is not loaded.

u-fischer commented 4 months ago

Not sure what we can do here. pdftex doesn't insert space chars at font switches:

\pdfcompresslevel0
\pdfobjcompresslevel0
\font\test=cmss10 
\pdfinterwordspaceon

text text {\test cmss cmss} text text

\bye

Not sure if this a bug or a restriction, so I did sent a message to the pdftex mailing list. One can insert the space chars manually with \pdffakespace, but it wouldn't be easy to detect in font commands if a space is wanted or not (and even more difficult if font switches are used).

FrankMittelbach commented 4 months ago

Can't test right now, but what happens to 'text \mbox{}' ? Is that space also lost?

car222222 commented 4 months ago

A ueseful question, but what should happen to an empty mbox in the PDF: should it be treated exactly as a word?

car222222 commented 4 months ago

Extra information request:

What happens if there is a font change without a group?

text \test cmss cmss

u-fischer commented 4 months ago
text \mbox{}. % space  char between text and .

text \mbox{text} % space char between text and text

text \mbox{\itshape text} % no space char between text and text

text \mbox{} \par % no space char after text

text text \test cmss cmss  % no space char between text and cmss
car222222 commented 4 months ago

So there is definitely a problem with losing spaces when there is a font change.

Maybe try putting in other "non-text stuff" to see whether it is more general than on only font changes --

Example:

{ cmss \count 234=32 text }

Also, is it only pdftex, or do the spaces also get lost using luatex?

car222222 commented 4 months ago

The example with \par is really strange compared with the other \mbox examples!

u-fischer commented 4 months ago

The example with \par is really strange compared with the other \mbox examples!

Not really. Obviously the \mbox is irrelevant, only the text and font changes matters and so text \mbox{}\par is not different to text \par

Also, is it only pdftex, or do the spaces also get lost using luatex?

luatex is not affected, there the space chars are not inserted by the engine but with our lua code.

car222222 commented 4 months ago

Aha! I had not known that "we" do not use luatex to make these magic space chars.

mbertucci47 commented 4 months ago

@u-fischer What does the tagging code do that makes the spaces disappear with it but appear without it? In copy/paste, I mean

u-fischer commented 4 months ago

@mbertucci47 Well this is a question for the maintainer of the reader. But imho without tagging the reader will use an heuristic to insert spaces between words: if the distance is large enough it will guess that this is a word space. This normally works quite ok, but can fail if the word spaces are small. With tagging the real spaces are relevant and decide if there is a word space or not.

davidcarlisle commented 4 months ago

@mbertucci47 for tagged pdf inter word spaces must have in the stream actual space characters U+0020 not as classically set by tex just have the words be placed by coordinate and spaces being implicit.

pdftex has a built in mechanism to "overprint" the word spaces by a space character while preserving the implicit spacing, but as this is a pdftex primitive behaviour when (as you show) it misses some word spaces, there is not a lot latex can do about it (other than report the problem upstream)

mbertucci47 commented 4 months ago

I see, thanks for the info

FrankMittelbach commented 4 months ago

well, while it can be argued to be a bug in pdftex (and perhaps that needs followup) we can probably get it fixed with something like

\def \DeclareTextFontCommand #1#2{%
  \DeclareRobustCommand#1[1]{%
    \ifmmode
      \nfss@text{#2##1}%
    \else
      \hmode@bgroup
       \unless\ifdim \the\fontdimen2\font < \lastskip
         \pdffakespace
       \fi
       \text@command{##1}%
       #2\check@icl ##1\check@icr
       \expandafter
      \egroup
    \fi
                       }%
}

and some check in \maybe@ic to see if the following char is a space and in that case also add a fake space.

Doing something similar for straight font changes using switches, e.g., ...\itshape .... \rmshape... could be possible too but is probably more fragile

davidcarlisle commented 4 months ago

@FrankMittelbach \xspace strikes again:-) we could catch some cases that way but @u-fischer's tests such as

text \mbox{\itshape text} % no space char between text and text

shows the difficulty of picking this up at the macro layer, I don't see how \itshape can look back and fix the space outside the current box.

FrankMittelbach commented 3 months ago

@FrankMittelbach \xspace strikes again:-) we could catch some cases that way but @u-fischer's tests such as

text \mbox{\itshape text} % no space char between text and text

shows the difficulty of picking this up at the macro layer, I don't see how \itshape can look back and fix the space outside the current box.

well, \mbox can perhaps do that in that case. As far as I can see having unnecessary \pdffakespaces around (in a row) doesn't matter (or does it?) and if not \mboxcould make the same test and inserts such a faked space in front of itself if it is preceded by a space.

However, I think it is really something to ask Thanh if it can't be fixed in pdfTeX proper.

car222222 commented 3 months ago

Did anyone else see a whole slew of irrelevant ideas from me?? Not sure where they came from, or how they got posted here!

Removed now, I hope permanently.

FrankMittelbach commented 3 months ago

you mean that "Hello!recall some of the many deficiencies..."? Yes that showed up in my inbox. Or anything else?

car222222 commented 3 months ago

Yes, Frank, that one!