jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.02k stars 3.35k forks source link

Configurable smart quotes #2620

Open renard opened 8 years ago

renard commented 8 years ago

I am currently using pandoc ans ist smart ponctuation to generate an epub output, this works pretty well.

In my source I have:

Du "texte" en français!

The output is:

Du “texte” en français!

But since the text is in French, I would like to use the French typography rules to get something like:

Du « texte » en français !

(please note the nonbreaking spaces).

So is there a (easy) way to define some typography rules for an output? or this should be an enhancement?

Tanks a lot.

jgm commented 8 years ago

You could try using the --html-q-tags option. Then use CSS to style the q tags appropriately.

If that doesn't work, then your options are:

+++ Sébastien Gross [Jan 05 16 10:54 ]:

I am currently using pandoc ans ist smart ponctuation to generate an epub output, this works pretty well.

In my source I have: Du "texte" en français!

The output is: Du “texte” en français!

But since the text is in French, I would like to use the French typography rules to get something like: Du « texte » en français !

(please note the nonbreaking spaces).

So is there a (easy) way to define some typography rules for an output? or this should be an enhancement?

Tanks a lot.

— Reply to this email directly or [1]view it on GitHub.

References

  1. https://github.com/jgm/pandoc/issues/2620
Phyks commented 8 years ago

I am experiencing a similar issue. At first, I would have expected --smart to handle typography for ponctuation as well, but it does not seem to do so.

First problem with --smart and writing text in French (and maybe some other languages) is that French language does not use curly quotes but French quotes « ». In some keyboard layouts (thinking in fr oss), they are easily reachable, but that is not the case on every keyboard layout (especially in Windows) and being able to automatically replace " " by « » could be very helpful. This could obviously be done using a post-processing script (or a Pandoc filter) but what about including a --french-quotes option in Pandoc to do it?

Second problem is that typography, and especially the position (and nature) of whitespaces differ a lot from one language to another. In particular, in French (contrary to English), there should be a non-breaking space before any double punctuation sign (!, ?, :, ;). Similar rules exists for the spaces enclosing quotes (should be SPACE « NON_BREAKING_SPACE TEXT NON_BREAKING_SPACE » SPACE if I remember correctly) and so on.

In particular, non breaking space are almost impossible to type easily (without special tweak of the keyboard layout). I think it would be awesome if Pandoc could handle it.

What do you think?

jgm commented 8 years ago

+++ Lucas Verney [Apr 14 16 15:10 ]:

I am experiencing a similar issue. At first, I would have expected --smart to handle typography for ponctuation as well, but it does not seem to do so.

First problem with --smart and writing text in French (and maybe some other languages) is that French language does not use curly quotes but French quotes « ». In some keyboard layouts (thinking in fr oss), they are easily reachable, but that is not the case on every keyboard layout (especially in Windows) and being able to automatically replace " " by « » could be very helpful. This could obviously be done using a post-processing script (or a Pandoc filter) but what about including a --french-quotes option in Pandoc to do it?

See #84. I'd actually never thought that a French writer would want to type " for quotes, and have them render with French quotes. But if that is the case, it wouldn't be all that hard to provide some kind of configurable option.

Another option would be localization, so that the quote style is affected by the lang metadata field. Though I gather many languages don't have one standard quoting style.

Third option would be localization + an override.

Phyks commented 8 years ago

Concerning the quotes, I may have an unusual approach, but indeed, " seems to me to be the widely available quote character, and most easily typable. So being able to use it to be automatically replaced to «/« would be awesome, in my opinion. Still, there should be a way to prevent automatic conversion (like escaping) to be able to type " in a French text as well (but the same problem stands for English typography).

:+1: for localization-based, using the lang metadata field. Or an override option. The advantage of the localization-based method is that it also permits to tweak non-breaking spaces depending on the language.

renard commented 8 years ago

Having a --french-quote is not a good idea since this is a very dedicated task. Having a --lang option is a better idea if you can extend a language map. Latex uses babel for that task.

jgm commented 7 years ago

Related issue #661

snan commented 7 years ago

I'm stoked AF at the idea of being able to set it up so I can get »Danish style quotes« from --smart with a babel-like solution♥

tetov commented 5 years ago

Hi!

I've been reading this issue and #84 as well as the documentation but I haven't really understood how this should work, and if it's implemented for my use case.

I write text in markdown that I convert to ICML to use in InDesign documents. When I write Swedish text I want quotes to be identical "".

Here is my input and outputs:

sh-4.4$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

sh-4.4$ pandoc -v
pandoc 2.5
Compiled with pandoc-types 1.17.5.4, texmath 0.11.1.2, skylighting 0.7.5
Default user data directory: /home/tetov/.pandoc
Copyright (C) 2006-2018 John MacFarlane
Web:  http://pandoc.org
This is free software; see the source for copying conditions.
There is no warranty, not even for merchantability or fitness
for a particular purpose.

sh-4.4$ cat test.md
---
lang: sv
---

"Test" ... --

sh-4.4$ pandoc -s -w icml -o test.icml test.md

sh-4.4$ cat test.icml
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<?aid style="50" type="snippet" readerVersion="6.0" featureSet="513" product="8.0(370)" ?>
<?aid SnippetType="InCopyInterchange"?>
<Document DOMVersion="8.0" Self="pandoc_doc">
    <RootCharacterStyleGroup Self="pandoc_character_styles">
      <CharacterStyle Self="$ID/NormalCharacterStyle" Name="Default" />

    </RootCharacterStyleGroup>
    <RootParagraphStyleGroup Self="pandoc_paragraph_styles">
      <ParagraphStyle Self="$ID/NormalParagraphStyle" Name="$ID/NormalParagraphStyle"
          SpaceBefore="6" SpaceAfter="6"> <!-- paragraph spacing -->
        <Properties>
          <TabList type="list">
            <ListItem type="record">
              <Alignment type="enumeration">LeftAlign</Alignment>
              <AlignmentCharacter type="string">.</AlignmentCharacter>
              <Leader type="string"></Leader>
              <Position type="unit">10</Position> <!-- first tab stop -->
            </ListItem>
          </TabList>
        </Properties>
      </ParagraphStyle>
      <ParagraphStyle Self="ParagraphStyle/Paragraph" Name="Paragraph" LeftIndent="0">
        <Properties>
          <BasedOn type="object">$ID/NormalParagraphStyle</BasedOn>
        </Properties>
      </ParagraphStyle>
    </RootParagraphStyleGroup>
    <RootTableStyleGroup Self="pandoc_table_styles">
      <TableStyle Self="TableStyle/Table" Name="Table" />
    </RootTableStyleGroup>
    <RootCellStyleGroup Self="pandoc_cell_styles">
      <CellStyle Self="CellStyle/Cell" AppliedParagraphStyle="ParagraphStyle/$ID/[No paragraph style]" Name="Cell" />
    </RootCellStyleGroup>
  <Story Self="pandoc_story"
      TrackChanges="false"
      StoryTitle=""
      AppliedTOCStyle="n"
      AppliedNamedGrid="n" >
    <StoryPreference OpticalMarginAlignment="true" OpticalMarginSize="12" />

<!-- body needs to be non-indented, otherwise code blocks are indented too far -->
<ParagraphStyleRange AppliedParagraphStyle="ParagraphStyle/Paragraph">
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content>“Test”</Content>
  </CharacterStyleRange>
  <CharacterStyleRange AppliedCharacterStyle="$ID/NormalCharacterStyle">
    <Content> … –</Content>
  </CharacterStyleRange>
</ParagraphStyleRange>

  </Story>

</Document>

What I want/expect is:

[...]
    <Content>”Test”</Content>
[...]

I can achieve achieve this by adding -f markdown-smart as an argument, but I'd rather keep the other fixes smart does.

Is this a planned feature (to have specific quotes for different languages in ICML output) or is the solution to use -smart?

jgm commented 5 years ago

@tetov - at this point, the solution is to use -smart. Maybe some day we'll implement configurable smart quotes, but it's not a priority now.

snan commented 5 years ago

In that case, I have a workaround for you, Anton: pipe the text through sed 's/"/”/g' before putting it into pandoc. You're lucky that your desired quotes aren't symmetrical so you don't have to use anything "smart" in order to get them.

Be aware that Swedish has some other typesetting quirks like using spaced endashes – like this – rather than English-style non-spaced emdashes—like this—and there are some other weird things.

So perhaps it's best to either make sure your source document already has the typography you want (I sometimes use emacs smart-quotes-mode for this) or you run it through a quick little sed, perl, or tr filter before pandoc. Does that work?

tetov commented 5 years ago

@jgm Thanks, I understand!

@snan I thought about processing the text but didn't really know where to put that processings and the examples found looked daunting (which were with symmetrical quotes). Thanks! I'll add it before pandoc in my makefile.

I wasn't aware that those differences existed! Thanks a lot for pointing them out. I have some reading to do :).

jgm commented 5 years ago

snan notifications@github.com writes:

In that case, I have a workaround for you, Anton: pipe the text through sed 's/"/”/g' before putting it into pandoc. You're lucky that your desired quotes aren't symmetrical so you don't have to use anything "smart" in order to get them.

This will work fine unless you have straight quotes in non-textual contexts: code, HTML attributes, titles in markdown links.

In that case, you could achieve the same thing by using a simple lua filter, in conjunction with -smart.

tetov commented 5 years ago

I'll need to spend some more time learning lua and lua-filters in order to get that to work. I've forked the lua-filters repo started to cobble together something from the existing samples.

In the meantime I made a hacky solution in my Makefile.

Thanks for your help, @jgm and @snan!

Edit: While working on adding single quotation marks as well as dashes I realized that I could run the sed commands on the output-file, like this:

sed -i -e 's/‘/’/g' -e 's/“/”/g' output.icml

This gives me all of the benefits of smart will still keeping symmetrical quotation marks. Pandoc respects spaces around en-dashes so that is not a problem either.

tetov commented 5 years ago

@jgm:

This will work fine unless you have straight quotes in non-textual contexts: code, HTML attributes, titles in markdown links. In that case, you could achieve the same thing by using a simple lua filter, in conjunction with -smart.

Which runs first; the smart function or the lua-filter? I were thinking about putting the regexp in my edit above into a LUA-filter to make it work with any output format.

jgm commented 5 years ago

Smartification takes place at the parsing stage, so in the filter you'll have Quoted objects you can replace.

Anton T Johansson notifications@github.com writes:

@jgm:

This will work fine unless you have straight quotes in non-textual contexts: code, HTML attributes, titles in markdown links. In that case, you could achieve the same thing by using a simple lua filter, in conjunction with -smart.

Which runs first; the smart function or the lua-filter? I were thinking about putting the regexp in my edit above into a LUA-filter to make it work with any output format.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/jgm/pandoc/issues/2620#issuecomment-453139046

snan commented 5 years ago

I do similar preprocessing (with sed and similar tools) to change • bullets into hyphen bullets. Man, I wish ✪ would add that to markdown, that's the one thing I really miss from how I write plain text files.

mrzool commented 5 years ago

Just wanted to chime in to say that localized smart quotes would be a fantastic feature to have.

As already said elsewhere, @Phyks suggestion to have a --french-quotes flag doesn't make much sense. Why pick just French when so many languages have their own quoting rules?

In my case, German uses as opening and as closing quotes. Being able to automate the conversion from straight to curly would be a tremendous boon and would help me enormously in the editorial work I do (mainly converting Markdown to HTML).

mb21 commented 5 years ago

converting Markdown to HTML

then see https://github.com/jgm/pandoc/issues/2620#issuecomment-169099590

mrzool commented 5 years ago

@mb21 Using the --html-q-tags flag would result in a <q> tag being used for everything between quotation marks. That would be wrong in a most cases, since that tag is used to mark up inline quotations, which is all but a small subset of my actual use cases. Beside being semantically incorrect, I just need clean HTML without any CSS.

Using proper German quotes in the input is what I already do — before converting the markdown with the --ascii flag to replace them with the corresponding HTML entities. I substitute manually every single straight quote in the drafts I receive from all over the place. It takes time, and that’s the process I’d like to automate.

As for using sed or perl to post-process the output, I didn’t explore the possibility, but that would be probably the way to go, before this functionality gets hopefully baked into Pandoc.

tarleb commented 5 years ago

@odkr wrote a great Lua filter to handle this problem: https://github.com/odkr/pandoc-quotes.lua. It is now also available as part of the pandoc lua-filters collection: https://github.com/pandoc/lua-filters/tree/master/pandoc-quotes.lua

mrzool commented 5 years ago

@odkr @tarleb That looks great. Thanks for bringing it to my attention.

jhutar commented 10 months ago

Hello! This might help. Assume you have this markdown doc:

---
lang: fr
csquotes: true
---

"Quotation test"

Using this command:

pandoc --pdf-engine=xelatex -o example.pdf example.md

You will get PDF with this quotation:

« Quotation test »
unera commented 1 week ago

You could try using the --html-q-tags option. Then use CSS to style the q tags appropriately.

This is a good way for HTML, but wrong way for (example) epub2. In EPUB2?/FB2 the tag <q> doesn't work properly, so android fbreader (for example) can't show the quotting properly.

So, it would be nice to have a way to convert "text" quotting to «text» without any tags.

PS: Also not only French uses such quotting style.

alerque commented 1 week ago

How about a Lua filter that replaces quote and double quote entities with plain elements with the quotes stuffed on the beginning and end? If you're trying to output to a different format and just worried about the output that should be pretty straight forward. If you want them in the source and round trip that might be a little more involved.

odkr commented 6 days ago

@unera and @alerque, I wrote that Lua filter a long time ago. (So long ago that I should have a look at again, but it should work.)