gohugoio / hugo

The world’s fastest framework for building websites.
https://gohugo.io
Apache License 2.0
75.47k stars 7.5k forks source link

Expose `hasCJKLanguage` as a Site Variable to enable multilingual search #6374

Closed gcushen closed 4 years ago

gcushen commented 5 years ago

Background

Currently, Hugo config has a hasCJKLanguage and Hugo pages can have isCJKLanguage to define a site or page as using Chinese, Korean, or Japanese characters.

The Problem

To implement multilingual frontend search in Hugo, we must use a different search configuration for CJK and non-CJK content. However, currently these parameters are not exposed to themes so there is no variable or function to return whether a site or page has CJK content.

The Solution

Expose the site (hasCJKLanguage) and page (isCJKLanguage) parameters as site and page variables.

For site search, hasCJKLanguage will be the most important parameter to expose.

bep commented 5 years ago

For site search, hasCJKLanguage will be the most important parameter to expose.

Do you understand that this is the "site owner" telling Hugo that the site may contain CJK language? It's a performance optimization which makes us not having to look at the content for non-CJK sites (e.g. my site in Norwegian).

What search engines are we talking about here? I would be surprised if the big ones do not have this part of their indexing routine.

gcushen commented 5 years ago

@bep there's a misunderstanding there, I am not referring to external search providers.

I am referring to the fact that implementing multilingual JS based search, for example a custom JS search algo or Fuse.js, in Hugo requires knowing whether or not the site contains a CJK language.

A JS search algo for a CJK site requires different params to a non-CJK site since the text analysis part of the algo must treat the characters differently.

Hence, exposing the site (hasCJKLanguage) and page (isCJKLanguage) parameters as site and page variables would resolve this :)

gcushen commented 4 years ago

@bep any thoughts on exposing the site (hasCJKLanguage) and page (isCJKLanguage) parameters as site and page variables in order to enable Hugo sites to have in-built multilingual search (via in-built JS algo) without requiring Hugo site admins to use external multilingual search providers?

Assume a site with content in both Chinese and English for example, without the Hugo template (and thus in-built JS search algo) knowing that a CJK language is present, the text analysis part of the search will fail or be inaccurate due to the fact characters must be treated very differently.

peaceiris commented 4 years ago

How about detecting your language by .Site.Language.Lang?

gcushen commented 4 years ago

How about detecting your language by .Site.Language.Lang?

So you are implying that hasCJKLanguage is an unnecessary option that should be removed by Hugo team, who should then determine if a site contains CJK by examining .Site.Language.Lang?

There are a couple of points here:

  1. Why haven't Hugo team implemented what you implied? I'm assuming because it may not be trivial or performance friendly to implement an automatic CJK detection
  2. language logic ideally belongs in Hugo itself, where CJK logic is already implemented but not exposed as a parameter to templates - all I'm asking is for it to be exposed as a param
  3. a multilingual site can set language to en but also set hasCJKLanguage = true due to presence of CJK characters in content - from a search perspective, one of the key things here is to know hasCJKLanguage = true which otherwise wouldn't be known if we just fetched the 'lang' param
peaceiris commented 4 years ago

Thank you @gcushen I was able to understand your explanation and found the following lines.

https://github.com/gohugoio/hugo/blob/bd98182dbde893a8a809661c70633741bbf63911/hugolib/page__meta.go#L579-L587

If hasCJKLanguage is true, Hugo already detects CJK by cjkRe.Match(). As bep said, it's a performance optimization.

If we can get .IsCJKLanguage like .IsHome as a page parameter, it looks convenient.

bep commented 4 years ago

@peaceiris IsCJKLanguage is not directly connected to .Site.Language.Lang.

I'm hesitant about exposing these settings without some thought because that would put a restriction on us in the future if we found a smarter way to do this. The site setting is just a performance thing (if we know it isn't CJK in the site, we don't need to look), so the relevant setting to expose would be .Page.IsCJKLanguage. But before doing that I would like to be pointed to a reference to a search engine that does not support this out of the box without being told so? Because I assume that any search indexing would mean tokenizing text, so you would have plenty of time to determine if it's CJK or not.

gcushen commented 4 years ago

@bep I mentioned a use case (Fuse.js, or custom client-side JS search engine) for this in a previous comment above (https://github.com/gohugoio/hugo/issues/6374#issuecomment-537154042 ). Essentially, parameters to client-side JS search can vary depending on the presence of CJK content in the index.

Sure, we could perform the same CJK regex from the Hugo Go code in the client-side JS, but (a) it's duplicating work rather than reusing the awesome functionality which you have already implemented in Hugo 😄 , and (b) for performance reasons, it would be preferable to determine CJK in Hugo template at build time rather than later on on client-side 🚀 .

Most multilingual users are perhaps using themes with no search or an external search like Algolia so probably haven't really considered the above points. Currently, Academic has integrated client-side search and this is my main reason for raising this issue to offer better multilingual search results. As Hugo moves more into the JS world, I assume others will follow and there will be more momentum with this kind of thing...

bep commented 4 years ago

OK, but we agree that this is limited to Page.IsCJKLanguage?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help. If this is a bug and you can still reproduce this error on the master branch, please reply with all of the information you have about it in order to keep the issue open. If this is a feature request, and you feel that it is still relevant and valuable, please tell us why. This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.

github-actions[bot] commented 2 years ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.