jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.34k stars 3.37k forks source link

Suppressing intersentence spacing in LaTeX out after abbrevations #3641

Open frabjous opened 7 years ago

frabjous commented 7 years ago

As you know, LaTeX puts extra space between sentences, or really after punctuation like . and ? by default.

Sometimes you don't want this, such as after abbreviations.

I was pleasantly surprised to learn that when when converting from markdown to LaTeX, after abbreviations such as "e.g.", and "i.e.", a nonbreaking space ~ is used in LaTeX output (though a regular space \ would seem more appropriate for "e.g.": still this is better than inter-sentence spacing).

However, this behavior does not seem to be implemented when converting from certain other formats (at least not when converting from .docx files).

The following phrases should be considered: "e.g.", "i.e.", "etc." [when next letter is lowercase], "chap.", "vol.", "p." "pp." -- maybe some others. If there's a reliable way of detecting it, then also in people's initials, e.g., "G.\ E.\ Moore".

mb21 commented 7 years ago

https://github.com/jgm/pandoc/issues/256 might solve your problem (I think it's in the nightlies, and will be released with pandoc 2.0), also see #3466

jgm commented 7 years ago

When converting from Markdown, pandoc parses spaces after known abbreviations as nonbreaking spaces. This is a special pandoc Markdown feature. When converting from other formats, pandoc just parses what it finds; if the authors have included a nonbreaking space in the docx file, this should come across, otherwise not. With pandoc's Markdown we can innovate, but with other formats we just try to reproduce what is there. So I think nothing should be done here.

frabjous commented 7 years ago

But that's just the point: it manifestly isn't reproducing what is there.

The issue is not the difference between a non-breaking space and a regular space. Indeed, a non-breaking is not what is appropriate after "e.g." or "i.e." anyway. As I wrote above, that's better than a large space, but it still isn't right. There should be a regular space there.

You start with a regular space in the other format. By converting it to LaTeX, which treats a space after a period as a large space, you are not reproducing what is in the source document, you are mangling it.

Now I suppose you could make a case that "reproducing what is there" means that you should make all spaces after period in a source documents that treats them all alike become regular space in the output. But I simply don't understand the argument that turning regular spaces in the source document in certain contexts (but not others) into large spaces where they are inconsistent and ugly is proper behavior.

jgm commented 7 years ago

I can reopen this. Note that \frenchspacing in the preamble of your LaTeX document will make the intersentence spaces regular spaces; this might solve your problem.

There are a couple of alternative solutions to think about. One would be to move the abbreviation detecting logic into the LaTeX writer -- but then other writers don't get to take advantage of the nice nonbreaking spaces, and anyway it's much easier to detect these things at the parsing stage.

Another would be to put sentences into the AST generally (#3466): a huge thing to implement if we do it in all readers.

A third alternative would be to modify all readers to include abbreviation detection. Again, a big job.

A fourth alternative would be to move abbreviation detection to an AST filter, run between readers and writers. This would allow it to work for all readers without duplication of code. But there would be some performance implications since we'd be walking the AST again.

frabjous commented 7 years ago

Understood, thanks a lot. It does sound like a tricky thing to know how best to implement.

I know about \frenchspacing, but one doesn’t always want that. Right now I'm using my own post-processing script, which works fine for me personally (especially as I'm doing some other custom modifications as part of my workflow), but I thought it was something I'd mention in case it was relevant to others.

jgm commented 7 years ago

I'm warming to the fourth alternative, above. Move abbreviation detection out of the Markdown reader and into a general purpose filter, which we could run automatically if a certain extension is enabled on the reader (abbreviations).

The filter would just have to scan the AST for patterns like

..., Str "e.g.", Space, ...

and convert the space to an appropriate unicode space character.

This would be configurable with a custom abbreviations list, as it currently is.

The real question is what the performance cost would be. My guess is that it will be small enough that most people won't notice; and those who care could always switch off this feature. (Also, note that we'd save on performance in the parser by not matching abbreviations there.) This needs to be tried and benchmarked.

jgm commented 7 years ago

Or we could simply run this transformation prior to writing as LaTeX (and perhaps ConTeXt), since it's not clear that the special treatment is really useful for other formats.

jgm commented 7 years ago

If we made it a LaTeX/ConTeXt specific transformation, another advantage is that we could use an escaped regular space instead of a nonbreaking space. I'm coming around to thinking that maybe this should be our approach.

There's also groff ms/man, where we distinguish sentence-ending periods by printing one sentence per line.