Option to transliterate paths

jmooring commented 2 years ago

Background

You can configure Hugo to remove non-spacing marks from composite characters in content paths by enabling removePathAccents in the site configuration.

content/áéíñóú.md --> https://example.org/aeinou/
content/ÄÖÜäöüß.md --> https://example.org/aouaou%C3%9F/
content/çđħłƚŧ.md --> https://example.org/c%C4%91%C4%A7%C5%82%C6%9A%C5%A7/

Removing the non-spacing marks has the desired effect in the first example, but it:

Has no effect on non-composite characters (e.g., ß, ł, Ł)
Is not language aware (e.g., for German, ä should become ae)

This issue has been raised a few times on the forum, and stale bot has closed three related issues that continue to receive comments:

Also:

Proposal

Provide an option to convert path characters from Unicode to ASCII, commonly called "Transliteration."

For a site with English (en) as the default content language:

content/áéíñóú.md --> https://example.org/aeinou/
content/ÄÖÜäöüß.md --> https://example.org/AOUaouss/
content/çđħłƚŧ.md --> https://example.org/cdhllt/

For a site with German (de) as the default content language:

content/ÄÖÜäöüß.md --> https://example.org/AeOeUeaeoeuess/

Include a related template function so that you can access term pages:

{{ with site.GetPage (path.Join "tags" ("çđħłƚŧ äöü" | transliterate | anchorize)) }}
  <a href="{{ .RelPermalink }}">{{ .LinkTitle }}</a>
{{ end }}

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help. If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open. If this is a feature request, and you feel that it is still relevant and valuable, please tell us why. This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

istr commented 1 year ago

Just a ping to keep this open. I still find this very useful and the implementation in #9135 looks sensible. There @bep stated:

I agree that we probably need something like this, but it needs to wait.

bep commented 7 months ago

I have read this issue.

Arrest me if I'm wrong, but with how Hugo treats content paths, this issue is a cosmetic issue about end URLs. The example above:

{{ with site.GetPage (path.Join "tags" ("çđħłƚŧ äöü" | transliterate | anchorize)) }}

Should now work fine as

{{ with site.GetPage "/tags/çđħłƚŧ äöü" }}

I don't think the first example would work at all.

I understand that people would want pretty URLs (that's what I use slug for), so that leaves:

Has no effect on non-composite characters (e.g., ß, ł, Ł) Is not language aware (e.g., for German, ä should become ae)

I have not seen a (relatively) complete language aware transliteration library (the one used in the referenced PR failed to transliterate my name in my language (which is also the language of this year's Nobel price winner in litterature).

I think this would be easier to fix if we drop the second point above. Then we could also possibly get away using the existing setting.

jmooring commented 7 months ago

That's great! With v0.123.0-DEV we can now make the round trip when removePathAccents is true, so a template function to go the other direction is not required.

If we take language-specific behavior out of the equation, all we need is something that does the equivalent of:

iconv -f utf-8 -t ascii//TRANSLIT <<< "áéíñóú çđħłƚŧ ÄÖÜäöüß"   # aeinou cdhllt AOUaouss

And that could certainly live under the existing setting.

istr commented 7 months ago

@bep

I have not seen a (relatively) complete language aware transliteration library (the one used in the referenced PR failed to transliterate my name in my language (which is also the language of this year's Nobel price winner in litterature).

I think this would be easier to fix if we drop the second point above. Then we could also possibly get away using the existing setting.

Yes, this is why I suggested a different approach some time ago (let the user choose the mapping and make it a config entry). https://github.com/gohugoio/hugo/issues/3476#issuecomment-946094440

The proposal was based on this blog post with an idea of how it could be implemented. https://www.von-laufenberg.de/blog/it/hugo-umlaute/

The requirements for transliteration can vary widely from use case to use case, so I think it would still be best not to rely on a (hardcoded) library, but to provide a versatile configurable mapping and maybe provide some sensible internal defaults or maybe even only in the documentation of the feature.

istr commented 7 months ago

I understand that people would want pretty URLs (that's what I use slug for), so that leaves:

This is a bit more than what slug can currently do. People want to generate search engine friendly URLs with a common transliteration for all generated URLs, including taxonomy and tags. I still don't see how this can be done with slug or removePathAccents alone.

I think this issue, and all the other related issues in the description, are specifically about transliterating/mangling the final URL with a mapping function, either using a hard-coded common transliteration or using a generic mapping. So the focus of all these questions is on the second point above (having a working mapping). Once you have that, you can easily work around the first problem (a very specific mapping function does not cover all expected cases).

A generic filtering/mapping function that could be hooked into the final stage of URL generation would be sufficient to handle all these use cases.

Although it might be technically easier to fix only half of them, it would not solve the problem or address the intended use cases.

So, in my opinion the most important part of the equation would be

If we take language-specific behavior out of the equation, all we need is something that does the equivalent of:

iconv -f utf-8 -t ascii//TRANSLIT <<< "áéíñóú çđħłƚŧ ÄÖÜäöüß" # aeinou cdhllt AOUaouss And that could certainly live under the existing setting.

which renders correctly with iconv, given you use the locale that contains the target language:

env | grep LC ; iconv -f utf-8 -t ascii//TRANSLIT <<< "áéíñóú çđħłƚŧ ÄÖÜäöüß"
LC_CTYPE=de_DE.UTF-8
aeinou cdhllt AEOEUEaeoeuess

(note that iconv creates Ä -> AE, Ö -> OE ... in that case).

LC_CTYPE=nn_NO.UTF-8 iconv -f utf-8 -t ascii//TRANSLIT <<< "Bjørn Erik Pedersen"
Bjoern Erik Pedersen

LC_CTYPE=nb_NO.UTF-8 iconv -f utf-8 -t ascii//TRANSLIT <<< "Bjørn Erik Pedersen"
Bjoern Erik Pedersen

(note that it has the expected output as per https://github.com/gohugoio/hugo/pull/11246#issuecomment-1927230891, both for Nynorsk and Bokmål, but I did not expect any differences between the two, to be honest)

bep commented 7 months ago

People want to generate search engine friendly URLs with a common transliteration for all generated URLs, including taxonomy and tags.

Are you sure the search engines cares about perfect transliteration? I suspect Google happily reads this:

https://example.org/c%C4%91%C4%A7%C5%82%C6%9A%C5%A7/

Which I guess is what we have today.

istr commented 7 months ago

Perhaps (hopefully) the use case for transliterating path segments or URLs is obsolete, or will be soon.

However, the number of comments on all these issues and the activity on the forum around this topic suggests otherwise. Random examples: https://discourse.gohugo.io/t/cyrillic-aware-slugify-function/27578/6, https://discourse.gohugo.io/t/replace-characters/43327/4 and linked threads.

At least as a human, I can (sort of) read https://example.org/cdhllt, but not (yet) https://example.org/c%C4%91%C4%A7%C5%82%C6%9A%C5%A7/. If it is printed somewhere, I have no problem typing cdhllt, a very hard time typing the second form, and highly doubt I would get çđħłƚŧ right. So there are still valid use cases for it.

However, it might also be an option to actively discourage users from using transliteration and point them to full UTF-8 support.

bep commented 7 months ago

OK, I have 2 concerns here:

Is maintenance; I don't want to maintain another language package / having to answer questions about "why this doesn't transliterates correctly in language x"
Speed.

The package we currrently use to remove accents has an API like below:

func main() {
    chain := transform.Chain(
        norm.NFD,
        runes.Map(func(r rune) rune {
            switch r {
            case 'ą':
                return 'a'
            case 'ć':
                return 'c'
            case 'ę':
                return 'e'
            case 'ł':
                return 'l'
            case 'ń':
                return 'n'
            case 'ó':
                return 'o'
            case 'ś':
                return 's'
            case 'ż':
                return 'z'
            case 'ź':
                return 'z'
            case 'ø':
                return 'o'
            }
            return r
        }),
        norm.NFC,
    )
    s, _, _ := transform.String(chain, "Bjørn Erik Pedersen")
    fmt.Println(s) // Works for me.
}

If we accept that the transliteration is a simple rune -> rune we could probably

Create a sensible default set.
Add an option to add (or) replace this per language. But I think we need to somehow avoid doing map lookups.

This is me thinking out loud.

jmooring commented 7 months ago

I was curious if there's a CLDR table...

Disabled temporarily. And there's probably a good reason for that.

istr commented 7 months ago

OK, I have 2 concerns here:

Is maintenance; I don't want to maintain another language package / having to answer questions about "why this doesn't transliterates correctly in language x"

I agree, which is why I would go with a more generic option, see my other comments.

Speed.

I agree as well. This is one of hugo's biggest USPs, so it is better not to sacrifice it for features.

The package we currrently use to remove accents has an API like below:
func main() {
  chain := transform.Chain(
      norm.NFD,
      runes.Map(func(r rune) rune {
          switch r {
          case 'ą':
              return 'a'
          case 'ć':
              return 'c'
          case 'ę':
              return 'e'
          case 'ł':
              return 'l'
          case 'ń':
              return 'n'
          case 'ó':
              return 'o'
          case 'ś':
              return 's'
          case 'ż':
              return 'z'
          case 'ź':
              return 'z'
          case 'ø':
              return 'o'
          }
          return r
      }),
      norm.NFC,
  )
  s, _, _ := transform.String(chain, "Bjørn Erik Pedersen")
  fmt.Println(s) // Works for me.
}
If we accept that the transliteration is a simple rune -> rune we could probably

Create a sensible default set.

Add an option to add (or) replace this per language. But I think we need to somehow avoid doing map lookups.

This is me thinking out loud.

From my point of view, it would be sufficient to simply expose the mapping in this function to config. So just replace the hardcoded switch statement with a configurable mapping that is configurable per target language.

Everything else could be left to the user, so they clearly know that it is up to them to provide the mapping they need.

Personally, I would bet a lot on the claim that a simple per-language configurable rune -> rune mapping will do the trick and make a lot of users happy.

EDIT: note, however, that the target rune would need to be multi-character to support ä - ae (German) and ж - zh (Cyrillic) use cases and that the source rune would need to be multi-character to support both forms of UTF-8 accent rendering use cases.

istr commented 7 months ago

But I think we need to somehow avoid doing map lookups.

Are you sure this would make a performance difference to the hard-coded switch statement? Hopefully the go implementation is close to O(1) for maps of this (presumably small) size.

bep commented 7 months ago

@istr no, you are right, for our use case, they seem to perform exactly the same: https://github.com/gohugoio/hugo/pull/11998

bep commented 7 months ago

OK, I have searched a little more around, and my current take on this is:

There's is no great transliteration Go library available, nothing that resembles a "standard way" of doing this. We don't have the resources to invent the wheel.
We have removePathAccents with a well defined behaviour, so we cannot add "different things" to that without breaking things.
I suggest that we add a paths config struct or something where we can put this.
We can name this new option something other than "transliterateSomething" so we can have an opening for doing better in future.
I think it should be possible to use the same API to create a default that matches most (e.g. ø => o) common cases.
But I don't think it should be too hard to allow custom mappings (per language).

jmooring commented 7 months ago

TLDR: I recommend deferring this indefinitely pending demand.

This started with the addition of the removePathAccents option, motivated (as far as I can tell) by desire for URL compatibility with other systems (e.g., migrating from Jekyll, Drupal, etc.).

But then you couldn't get to the term page with any of these:

{{ with "áéíñóú" }}
  {{ (site.Taxonomies.tags.Get .).Page.RelPermalink }}
  {{ (index site.Taxonomies.tags .).Page.RelPermalink }}
  {{ (site.GetPage (printf "/tags/%s" .)).RelPermalink }}
{{ end }}

And that generated some noise, in the Academic/Wowchemy/HugoBlox world in particular, despite the introduction of .Page.GetTerms a few years later, which covers the majority of the use cases. I still see themes doing it the hard way instead of using .Page.GetTerms.

The inability to get back to the term page was the primary driver for creating this issue, irrelevant with v0.123.0.

And then came the desire to have "accents" removed from non-composite characters, which is impossible, because they are not composite characters. I'm not sure if this desire was driven by compatibility requirements, aesthetic preference, or just a lack of understanding (e.g., "It's broken. It's not removing my accents.").

So that means transliteration. But as soon as you open that box, it needs to be language specific.

In my view there is insufficient "compatibility" or "aesthetic" demand to pursue this at the moment. The changes in v0.123.0 solved the initial problem, and actually solved another one in this area as well... all three work great:

{{ with "tag c" }}
  {{ (site.Taxonomies.tags.Get .).Page.RelPermalink }}
  {{ (index site.Taxonomies.tags .).Page.RelPermalink }}
  {{ (site.GetPage (printf "/tags/%s" .)).RelPermalink }}
{{ end }}

idarek commented 7 months ago

As @jmooring mentioned: " deferring this indefinitely pending demand."

My case was #7542 but I am not crying about it. I learned to live with it. There are other, more demanding things, that I think are worth spending more time on than this. Unless something simple is found out, I agree that deferring this will be the best approach. There are some good ideas in this issue, but it's all about how much time is allowed to be spent on that compared to the needs of users (me included).

jmooring commented 6 months ago

As a data point related to aesthetically pleasing URLs, Wikipedia doesn't feel this important.

In the browser's address bar you see this:

https://de.wikipedia.org/wiki/Straußwirtschaft

When you cut/paste the URL and copy it into an email (for example):

https://de.wikipedia.org/wiki/Strau%C3%9Fwirtschaft

Hugo's current behavior is identical. I'm inclined to remove "aesthetically pleasing URLs" as a reason to pursue this, leaving only compatibility with other systems that transliterate (e.g., Drupal, where transliteration is disabled by default).

istr commented 6 months ago

@jmooring Nice Wikipedia entry, I love to go to a Straußwirtschaft (aka Besenwirtschaft) in late summer.

I don't follow your argument here though. "We don't have a use case because {fill in any big Internet player here} doesn't care" is not a plausible argument. In fact, it is a fallacy (ad populum). Using the same fallacy, I could argue the opposite: transliteration is standardised, so we have a use case. See https://en.wikipedia.org/wiki/List_of_ISO_romanizations. Or that even the Serbian government provides a transliteration of its website (https://www.srbija.gov.rs/, select "Latinica").

I would consider both arguments invalid because they ignore the context. We have a use case for transliteration only because many Hugo users (including myself), for various reasons and repeatedly over a long period of time, seem to have a use case that is expressed in several GitHub issues and forum posts.

One could argue that "aesthetically pleasing URLs" is not a valid use case to begin with. But there are many other valid use cases, such as the (common) Cyrillic romanisation use case mentioned above, which was raised by a real Hugo user in a forum post.

jmooring commented 6 months ago

Not an argument, just a...

data point

gohugoio / hugo

Option to transliterate paths #9134

Background

Proposal