Closed gcushen closed 4 years ago
For site search, hasCJKLanguage will be the most important parameter to expose.
Do you understand that this is the "site owner" telling Hugo that the site may contain CJK language? It's a performance optimization which makes us not having to look at the content for non-CJK sites (e.g. my site in Norwegian).
What search engines are we talking about here? I would be surprised if the big ones do not have this part of their indexing routine.
@bep there's a misunderstanding there, I am not referring to external search providers.
I am referring to the fact that implementing multilingual JS based search, for example a custom JS search algo or Fuse.js, in Hugo requires knowing whether or not the site contains a CJK language.
A JS search algo for a CJK site requires different params to a non-CJK site since the text analysis part of the algo must treat the characters differently.
Hence, exposing the site (hasCJKLanguage
) and page (isCJKLanguage
) parameters as site and page variables would resolve this :)
@bep any thoughts on exposing the site (hasCJKLanguage
) and page (isCJKLanguage
) parameters as site and page variables in order to enable Hugo sites to have in-built multilingual search (via in-built JS algo) without requiring Hugo site admins to use external multilingual search providers?
Assume a site with content in both Chinese and English for example, without the Hugo template (and thus in-built JS search algo) knowing that a CJK language is present, the text analysis part of the search will fail or be inaccurate due to the fact characters must be treated very differently.
How about detecting your language by .Site.Language.Lang
?
How about detecting your language by
.Site.Language.Lang
?
So you are implying that hasCJKLanguage
is an unnecessary option that should be removed by Hugo team, who should then determine if a site contains CJK by examining .Site.Language.Lang
?
There are a couple of points here:
en
but also set hasCJKLanguage = true
due to presence of CJK characters in content - from a search perspective, one of the key things here is to know hasCJKLanguage = true
which otherwise wouldn't be known if we just fetched the 'lang' paramThank you @gcushen I was able to understand your explanation and found the following lines.
If hasCJKLanguage
is true, Hugo already detects CJK by cjkRe.Match()
. As bep said, it's a performance optimization.
If we can get .IsCJKLanguage
like .IsHome
as a page parameter, it looks convenient.
@peaceiris IsCJKLanguage is not directly connected to .Site.Language.Lang
.
I'm hesitant about exposing these settings without some thought because that would put a restriction on us in the future if we found a smarter way to do this. The site setting is just a performance thing (if we know it isn't CJK in the site, we don't need to look), so the relevant setting to expose would be .Page.IsCJKLanguage
. But before doing that I would like to be pointed to a reference to a search engine that does not support this out of the box without being told so? Because I assume that any search indexing would mean tokenizing text, so you would have plenty of time to determine if it's CJK or not.
@bep I mentioned a use case (Fuse.js, or custom client-side JS search engine) for this in a previous comment above (https://github.com/gohugoio/hugo/issues/6374#issuecomment-537154042 ). Essentially, parameters to client-side JS search can vary depending on the presence of CJK content in the index.
Sure, we could perform the same CJK regex from the Hugo Go code in the client-side JS, but (a) it's duplicating work rather than reusing the awesome functionality which you have already implemented in Hugo 😄 , and (b) for performance reasons, it would be preferable to determine CJK in Hugo template at build time rather than later on on client-side 🚀 .
Most multilingual users are perhaps using themes with no search or an external search like Algolia so probably haven't really considered the above points. Currently, Academic has integrated client-side search and this is my main reason for raising this issue to offer better multilingual search results. As Hugo moves more into the JS world, I assume others will follow and there will be more momentum with this kind of thing...
OK, but we agree that this is limited to Page.IsCJKLanguage
?
This issue has been automatically marked as stale because it has not had recent activity. The resources of the Hugo team are limited, and so we are asking for your help.
If this is a bug and you can still reproduce this error on the master
branch, please reply with all of the information you have about it in order to keep the issue open.
If this is a feature request, and you feel that it is still relevant and valuable, please tell us why.
This issue will automatically be closed in the near future if no further activity occurs. Thank you for all your contributions.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
Background
Currently, Hugo config has a
hasCJKLanguage
and Hugo pages can haveisCJKLanguage
to define a site or page as using Chinese, Korean, or Japanese characters.The Problem
To implement multilingual frontend search in Hugo, we must use a different search configuration for CJK and non-CJK content. However, currently these parameters are not exposed to themes so there is no variable or function to return whether a site or page has CJK content.
The Solution
Expose the site (
hasCJKLanguage
) and page (isCJKLanguage
) parameters as site and page variables.For site search,
hasCJKLanguage
will be the most important parameter to expose.