gohugoio / hugo

The world’s fastest framework for building websites.
https://gohugo.io
Apache License 2.0
75.38k stars 7.49k forks source link

Taxonomy page with duplicate entries #12191

Closed earthboundkid closed 7 months ago

earthboundkid commented 7 months ago

On my list of pages by topic, topics with commas or single quotes in the name are listed twice. One listing uses the appropriate _index.md file as metadata, and the other does not.

See #12174.

What version of Hugo are you using (hugo version)?

$ hugo version
hugo v0.123.7-312735366b20d64bd61bff8627f593749f86c964+extended darwin/arm64 BuildDate=2024-03-01T16:16:06Z VendorInfo=brew

Does this issue reproduce with the latest release?

Yes.

earthboundkid commented 7 months ago

If you take this

git clone --single-branch -b hugo-github-issue-12193 https://github.com/jmooring/hugo-testing hugo-github-issue-12193
cd hugo-github-issue-12193
rm -rf public && hugo && cat public/authors/index.html

And change "John Smith" in books-1.md and books-2.md to "Smith, John", this will reproduce.

jmooring commented 7 months ago

Author terms were also applied to the home page. So, 3 places. You changed only 2.

git clone --single-branch -b hugo-github-issue-12191 https://github.com/jmooring/hugo-testing hugo-github-issue-12191
cd hugo-github-issue-12191
hugo server

Works fine.

earthboundkid commented 7 months ago

@jmooring My site is still broken in 123. A failed attempt at making a minimal repro is not a reason to close the issue.

jmooring commented 7 months ago

I'll re-open, but we need to be able to reproduce the problem. If we can't reproduce the problem, the issue will be closed. The ball is in your court.

earthboundkid commented 7 months ago

The reproducer is to clone https://github.com/spotlightpa/poor-richard and build it with v122 and v123 and look at /investigations/. I can work on minimizing the reproduction next week.

jmooring commented 7 months ago

I am not able to reproduce the problem with duplicate entries (terms) listed on the investigations (series) taxonomy page, but there are three terms that are not behaving as desired, causing the same image to appear for each of the three entries. As shown in the the content front matter of the news articles, the three terms are:

series = ["One Vote, Two Pennsylvanias"]
series = ["Shapiro's Promises"]
series = ["Unproven, Unsafe"]

These are the only terms (not titles) in the series taxonomy that contain punctuation. In the short term, the simplest solution is to remove the punctuation from the term assignments, as you have done for other terms that ultimately obtain their punctuated title from a term page (e.g., series = ["Short Days Big Benefits"] has punctuated title "Short Days, Big Benefits").

Longer term this isn't great if you punctuate terms. Not only because you need to punctuate the file path, but also because Windows disallows certain characters (* " / \ < > : | ?). Which means I can use the term "a > b" on Linux/Mac, but that's not going to work on Windows...

Screenshot 2024-03-02 162025

And I can't use the term "either/or" regardless of operating system if I want to be able to override the title or add metadata in a term page (e.g., content/tags/foo/_index.md). But I could not use "either/or" with v0.122.0 either, or things like "c#".

I've updated the test branch for this issue:

git clone --single-branch -b hugo-github-issue-12191 https://github.com/jmooring/hugo-testing hugo-github-issue-12191
cd hugo-github-issue-12191
hugo server

See related: https://discourse.gohugo.io/t/hugo-v0-123-seems-to-anchorize-terms-in-their-urls-rather-than-urlize-them/48583/5

jmooring commented 7 months ago

Setting punctuation aside for a moment (we need to figure out what to do there), note that with v0.123.7 you can do this:

{{ $series = (site.Taxonomies.series.Get .).Page }}

instead of this:

{{ $series = (site.Taxonomies.series.Get (lower .)).Page }}

and never do this:

{{ $series = (site.Taxonomies.series.Get (urlize .)).Page }}

See https://github.com/gohugoio/hugo/pull/12180.

jmooring commented 7 months ago

Possible approaches:

Documentation:

Taxonomy terms may contain Unicode letters, Unicode numbers, spaces, and any of the following characters:

  • _ (underscore)
  • - (hyphen)
  • . (period)
  • @ (at sign)
  • ~ (tilde)

Note that spaces and hyphens are equivalent, so these terms are equivalent:

Although these two terms have the same URL (collide), we cannot disallow hyphens or spaces due to prevalence in the wild.

To add other characters to the term title, create a term page at content/taxonomy/term/_index.md.

The inclusion list above prevents things like tags = ['ab','a,b','a:b'] where the term page URL will be the same in all three cases (i.e., /tags/ab)... this is the other dimension to the punctuation challenge which has been present since forever.

Documentation + error checking:

Same as above, but throw error too.

Either option above would make https://github.com/gohugoio/hugo/issues/8232 irrelevant.


I spent a lot of time looking at this before making the recommendation above. If you're interested in the details...

git clone --single-branch -b hugo-forum-topic-48638 https://github.com/jmooring/hugo-testing hugo-forum-topic-48638
cd hugo-forum-topic-48638
hugo server

You can efficiently validate a site's taxonomy terms with something like:

layouts/partials/validate-taxonomy-terms.html ```text {{ if .IsHome }} {{ range $taxonomy, $_ := site.Taxonomies }} {{ range $term, $_ := . }} {{ if findRE `[^\pL\pN\s_\-\.@~]` $term }} {{ errorf `The term %q in taxonomy %q is invalid. Taxonomy terms may contain Unicode letters, Unicode numbers, spaces, and any of the following characters: "_", "-", ".", "@", and "~".` $term $taxonomy }} {{ end }} {{ end }} {{ end }} {{ end }} ```

Call it from your base template; it runs once for each language.


See this tips & tricks topic: https://discourse.gohugo.io/t/limit-use-of-punctuation-within-taxonomy-terms/48638

earthboundkid commented 7 months ago

From my perspective, this is a regression because there was no problem using commas and hyphens before. If taxonomy terms must be "letters" does this mean users who don't use the Roman alphabet can't use taxonomies? ISTM it would be a shame if in spite of all the work in supporting localized sites, you couldn't use one of Hugo's major features with non-English languages or even "résumés" and "cafés".

jmooring commented 7 months ago

@earthboundkid Your issue title is "Taxonomy page with duplicate entries." May I re-title this? Because I don't see any duplicate entries.

jmooring commented 7 months ago

@earthboundkid Everything in the string "résumés" is a letter. So this works as it always has:

---
tags: ['résumés']
---
content/
├── tags/
│   └── résumés/
│       └── _index.md
└── _index.md
earthboundkid commented 7 months ago

There are duplicate entries for the three terms with punctuation in them:

image image Screenshot 2024-03-04 at 9 58 20 AM

This is from a list of $.Pages. It's not doing some fancy lookup of site.Taxonomy and having weird results. It's just looping through .Pages and getting the same entry twice, once with and once without the associated metadata.

The workaround fix is to rename the term folder in the series folder from the urlized version to the full version (eg. rename /series/shapiros-promises to "/series/Shapiro's Promises".

Term canonicalization has always been a sort of sticky issue. I seem to remember early versions of Hugo would just lowercase all tag names, even ones that were names of people. That's been fixed for a long time though. Since the URL for the term page is at the urlized version anyway, I don't see why Hugo is using the lowercase name as the canonical name for the term.

A trickier question is let's say there's some language were "Aa" means "hat" and "Äa" means "coat" and you want to have terms for both of them. In that case, yes, Hugo probably distinguish a folder named "Aa" from a folder named "Äa", but I think the algorithm should be something like:

earthboundkid commented 7 months ago

The workaround fix is to rename the term folder in the series folder from the urlized version to the full version (eg. rename /series/shapiros-promises to "/series/Shapiro's Promises".

This workaround isn't compatible with Hugo 122 and creates duplicates in that version. :-/

jmooring commented 7 months ago

Regarding v0.122.0 compatibility... we've had this discussion. This was an intentional breaking change. While you may not agree with it, the change stands.

Also, while you may not agree with it, I am closing this issue. Please create a separate issue if you wish to propose new functionality.

github-actions[bot] commented 6 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.