Open frabjous opened 7 years ago
https://github.com/jgm/pandoc/issues/256 might solve your problem (I think it's in the nightlies, and will be released with pandoc 2.0), also see #3466
When converting from Markdown, pandoc parses spaces after known abbreviations as nonbreaking spaces. This is a special pandoc Markdown feature. When converting from other formats, pandoc just parses what it finds; if the authors have included a nonbreaking space in the docx file, this should come across, otherwise not. With pandoc's Markdown we can innovate, but with other formats we just try to reproduce what is there. So I think nothing should be done here.
But that's just the point: it manifestly isn't reproducing what is there.
The issue is not the difference between a non-breaking space and a regular space. Indeed, a non-breaking is not what is appropriate after "e.g." or "i.e." anyway. As I wrote above, that's better than a large space, but it still isn't right. There should be a regular space there.
You start with a regular space in the other format. By converting it to LaTeX, which treats a space after a period as a large space, you are not reproducing what is in the source document, you are mangling it.
Now I suppose you could make a case that "reproducing what is there" means that you should make all spaces after period in a source documents that treats them all alike become regular space in the output. But I simply don't understand the argument that turning regular spaces in the source document in certain contexts (but not others) into large spaces where they are inconsistent and ugly is proper behavior.
I can reopen this. Note that \frenchspacing
in the preamble of your LaTeX document will make the intersentence spaces regular spaces; this might solve your problem.
There are a couple of alternative solutions to think about. One would be to move the abbreviation detecting logic into the LaTeX writer -- but then other writers don't get to take advantage of the nice nonbreaking spaces, and anyway it's much easier to detect these things at the parsing stage.
Another would be to put sentences into the AST generally (#3466): a huge thing to implement if we do it in all readers.
A third alternative would be to modify all readers to include abbreviation detection. Again, a big job.
A fourth alternative would be to move abbreviation detection to an AST filter, run between readers and writers. This would allow it to work for all readers without duplication of code. But there would be some performance implications since we'd be walking the AST again.
Understood, thanks a lot. It does sound like a tricky thing to know how best to implement.
I know about \frenchspacing
, but one doesn’t always want that. Right now I'm using my own post-processing script, which works fine for me personally (especially as I'm doing some other custom modifications as part of my workflow), but I thought it was something I'd mention in case it was relevant to others.
I'm warming to the fourth alternative, above. Move abbreviation detection out of the Markdown reader and into a general purpose filter, which we could run automatically if a certain extension is enabled on the reader (abbreviations
).
The filter would just have to scan the AST for patterns like
..., Str "e.g.", Space, ...
and convert the space to an appropriate unicode space character.
This would be configurable with a custom abbreviations list, as it currently is.
The real question is what the performance cost would be. My guess is that it will be small enough that most people won't notice; and those who care could always switch off this feature. (Also, note that we'd save on performance in the parser by not matching abbreviations there.) This needs to be tried and benchmarked.
Or we could simply run this transformation prior to writing as LaTeX (and perhaps ConTeXt), since it's not clear that the special treatment is really useful for other formats.
If we made it a LaTeX/ConTeXt specific transformation, another advantage is that we could use an escaped regular space instead of a nonbreaking space. I'm coming around to thinking that maybe this should be our approach.
There's also groff ms/man, where we distinguish sentence-ending periods by printing one sentence per line.
As you know, LaTeX puts extra space between sentences, or really after punctuation like
.
and?
by default.Sometimes you don't want this, such as after abbreviations.
I was pleasantly surprised to learn that when when converting from markdown to LaTeX, after abbreviations such as "e.g.", and "i.e.", a nonbreaking space
~
is used in LaTeX output (though a regular space\
would seem more appropriate for "e.g.": still this is better than inter-sentence spacing).However, this behavior does not seem to be implemented when converting from certain other formats (at least not when converting from
.docx
files).The following phrases should be considered: "e.g.", "i.e.", "etc." [when next letter is lowercase], "chap.", "vol.", "p." "pp." -- maybe some others. If there's a reliable way of detecting it, then also in people's initials, e.g., "G.\ E.\ Moore".