quarto-dev / quarto-cli

Open-source scientific and technical publishing system built on Pandoc.
https://quarto.org
Other
3.59k stars 295 forks source link

Don't convert ascii to unicode in URLs #10042

Open cscheid opened 4 weeks ago

cscheid commented 4 weeks ago

This happens in metadata, but it's likely that it also happens in other parts.

Repro:

---
format: html
value: https://xn--pr-350quartodoc-ju9h.netlify.app/api/Auto.html#quartodoc.Auto
---

{{< meta value >}}

(This is, ultimately, the same bug that caused us to have to reimplement shortcodes: {{< foo https://pr-350--quartodoc.netlify.app >}} would destroy the URL in the past.)

cscheid commented 4 weeks ago

Quarto really ought to do this automatically ahead of Pandoc.

But I really do feel that this is a Pandoc "bug" in that its smart ASCII-to-unicode processing is way too eager in the presence of URLs.

In Quarto 1.5, we have a (n admittedly fairly gross) syntax for "escaping" arbitrary Pandoc content through its Markdown representation. Consider this:

```` --- format: html value: xn--oh.no value2: '`Str "xn--oh.no"`{=pandoc-native}' --- {{< meta value >}} {{< meta value2 >}} ```` image
cscheid commented 4 weeks ago

Notably, my workaround is format-agnostic (because it produces an actual pandoc.Str entry in the metadata object). In contrast, @mcanouil's suggestion in #10021 only works for specific formats.

@machow If you need 1.5 to ship pristine URLs across metadata, and you know that they're URLs, you can use that syntax.

We should have a transparent mechanism for this, but the {=pandoc-native} trick should get you going.

cderv commented 3 weeks ago

Re-posting below for context the explanation regarding why Pandoc does convert to en-dash.

This is all due to Pandoc Markdown reader when +smart extension is set, which is the default for from: markdown

So we could also opt-out this extension in our qmd reader (from: markdown-smart) and this won't ever happen.

Though it would have other impact on content output (especially for TeX ligatures in LaTeX pdf output)

From https://github.com/quarto-dev/quarto-cli/issues/10021#issuecomment-2175680126

How smart extension causes -- to be read as unicode by markdown reader > Nothing should be turning -- into en-dashes. (Maybe?) Pandoc is doing that, Just want to add additional information on this. This is Pandoc. It has a `+smart` extension that does this. See https://pandoc.org/MANUAL.html#extension-smart > Interpret straight quotes as curly quotes, --- as em-dashes, -- as en-dashes, and ... as ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.” This extensions is activated by default and impact how things are written in output. HTML is among the format where en-dash are used
With smart extension without smart extensions
````powershell ❯ quarto pandoc --from markdown --to html pr--450 pr-450 ^Z

pr–450 pr-450

````
````powershell ❯ quarto pandoc --from markdown-smart --to html pr--450 pr-450 ^Z

pr--450 pr-450

````
Note the two dashes without smart enabled. This all happens in the Markdown reader !
With smart extension without smart extensions
````powershell ❯ quarto pandoc --from markdown --to native pr--450 ^Z [ Para [ Str "pr\8211\&450" ] ] ```` ````powershell ❯ quarto pandoc --from markdown-smart --to native pr--450 ^Z [ Para [ Str "pr--450" ] ] ````
Why does it happens with metadata field ? Because they are parsed as Markdown values by Pandoc From https://pandoc.org/MANUAL.html#extension-yaml_metadata_block > Metadata can contain lists and objects (nested arbitrarily), but **all string scalars will be interpreted as Markdown**.

Related issue in the past where internally using the new pandoc-native raw block feature from pandoc was the way

mcanouil commented 3 weeks ago

If pandoc-native is the way, then I think the following part (and subsequent parts) of the codebase for href might need refactoring: