latex3 / tagpdf

Tagging support code for LaTeX
60 stars 7 forks source link

Add spaces between words #1

Closed zauguin closed 5 years ago

zauguin commented 6 years ago

A first draft to add most spaces. With this only the spaces at the end of a line are missing, so the text representation becomes almost readable. This could be fixed by running it before linebreaking, but that could break some macros. Also currently every space comes from the font which is current when the package is loaded, which simplifies the code but increases the PDF size by adding unnessary font switches and leads to problems if this font is not adequately well-formed.

u-fischer commented 6 years ago

Thank you for looking into this! My plan was to adapt/update a function injectspace of Hans Hagen from an older tugboat article https://www.tug.org/TUGboat/tb31-3/tb99hagen.pdf. I think your code is doing something similar.

I suspect that one doesn't have to worry about line breaks. My guess is that if they don't end with a soft-hyphen they count as a space anyway.

But one should really consider where the space char is taken from.

zauguin commented 6 years ago

@u-fischer Should you use the code from Hans Hagen, remember testing with non unicode-math math fonts. They are often used with LuaLaTeX and do not have a space on slot 32.

Regarding line breaks: I thought so too, but Acrobat disagreed...

u-fischer commented 6 years ago

Hm yes math ;-). Probably one shouldn't/needn't insert spaces there (and perhaps also not in artifacts). Which probably means that one will have to mark the math with attributes, looking only for the mathon/mathoff glyph could fail at page breaks. I made a branch from the code.

How did you test in acrobat how it handles "fake" spaces?

u-fischer commented 6 years ago

I have pondering about this. line breaks are a problem and I don't see a really good way to solve this in this function. So I think that one can insert the fake spaces during this traversing, but that one should identify the relevant char nodes earlier, e.g. in the pre linebreak callback. One could add an attribute to the "last char" and test for it later. This would probably also avoid that there are lots of fake spaces in math and other places where there are not needed.

zauguin commented 6 years ago

That sounds like like a great idea. Instead of adding attributes to the glyph I would add them to the glue representing the space itself through, that seems clearer. These would not be lost in case of a line break because a line break at a glue transforms the glue (including its attributes) into the right_skip glue node.

Regarding the font of the space character: What do you think about the pdfTeX approach? When \pdfinterwordspaceon is active, pdfTeX includes fake spaces by using a dedicated font(dummy-space.pfb) which only includes a zero width space character. This would make the code easier by not needing an additional negative glue and there is no risk of some TeX font not having a space in the space slot.

u-fischer commented 6 years ago

If the glue doesn't disappear at line breaks it is certainly more logical to mark up this.

Regarding the dummy-space.pfb: Its glyph isn't a zero width character

 {\font\test = dummy-space \test \showthe\fontcharwd\font32}

reports a width of 0.01pt. So I guess negative space is needed anyway. I think I would prefer to use the hopefully existing space char of the current font to avoid lots of font switches in the pdf. And I think one can expect from people who put quite some work in a document to get it tagged to use sensible fonts ;-).

u-fischer commented 5 years ago

I now implemented the code: I mark up the locations in pre_linebreak_filter and hpack_filter, and add the space glyphs in the main loop. It seems to work fine. I also added some code to test the "current" font for a space glyph and fall back to latin modern if not.