quarkusio / search.quarkus.io

Search backend for Quarkus websites
Apache License 2.0
2 stars 6 forks source link

Localized sites support #33

Closed ynojima closed 11 months ago

ynojima commented 1 year ago

quarkus.io has localized sites(es.quarkus.io, pt.quarkus.io, cn.quarkus.io, and ja.quarkus.io). If search.quarkus.io supports query for localized sites, it would be really helpful for users visiting localized sites.

For ja.quarkus.io, https://github.com/quarkusio/ja.quarkus.io repository contains .adoc.po files under the https://github.com/quarkusio/ja.quarkus.io/tree/main/l10n/po/ja_JP directory. Each .adoc.po file path corresponds to the location of the original .adoc file path in the upstream https://github.com/quarkusio/quarkusio.github.io repository. The .po files contain entries of original text and localized text. If an entry is marked with "fuzzy", it is not reviewed by human, not published to the localized site, so the original english text should be indexed instead.

For example, since the following entry is marked with "fuzzy", https://github.com/quarkusio/ja.quarkus.io/blob/main/l10n/po/ja_JP/_versions/3.2/guides/telemetry-micrometer.adoc.po#L15-L20 the original text is published to the locaized site: https://ja.quarkus.io/version/3.2/guides/telemetry-micrometer

To load .po file, "org.fedorahosted.tennera:jgettext" may be a good library candidate. https://central.sonatype.com/artifact/org.fedorahosted.tennera/jgettext

yrodiere commented 1 year ago

Hey, thanks for all the pointers.

Considering the relative complexity of loading the translations, I'm wondering if we shouldn't handle #30 first... Then we might not even have to care about what's translated and what is not, we'd just index the rendered guide?

The main issue would be with indexing metadata typically found in the Asciidoc source but not necessarily in the rendered content, such as :topics: or :categories:. But I'm not sure these get translated?

Am a right to believe the docs branch is where I should look for rendered pages?

ynojima commented 1 year ago

Oh, if you have plan to index rendered html, I agree with you that the switch should be done first.

:topics: and :categories:

Do you mean those in the YAML front-matter? It is not translated.

Am a right to believe the docs branch is where I should look for rendered pages?

Yes, the rendered site is saved in the docs directory of the docs branch. https://github.com/quarkusio/ja.quarkus.io/tree/docs/docs

yrodiere commented 1 year ago

Do you mean those in the YAML front-matter? It is not translated.

Yes. Alright, then indexing the HTML should work fine.

Yes, the rendered site is saved in the docs directory of the docs branch. https://github.com/quarkusio/ja.quarkus.io/tree/docs/docs

Gotcha, thanks.

yrodiere commented 11 months ago

@marko-bekhta some comments if you're going to work on this:

  1. Not all localized websites actually translate guides, so you'll probably be forced to treat each language differently. Some might simply rely on english data, some might have their own data.
  2. We'll now need to deal with multiple git repositories... please make sure to adapt QuarkusIOSample and the readme (in particular for development environments) accordingly. It will probably be a pain though :/
  3. I'm not sure quarkus.yaml gets translated, so... handling titles and summaries might be challenging
  4. Regarding the model, you'll need to make sure to index data for different languages in different fields, with different analyzers in particular.
    1. Your analyzer definitions will need to account for some parts of guides not being translated and staying in English...
    2. Regarding the entity Java code and mapping:
      1. The first question is whether you'll create one Guide instance per language or put everything in the same entity using dedicated data structure to account for internationalization (e.g. I18nData<T> with properties public T en; public T es; public T jp;). I don't know what's best.
      2. The second question is how you'll map that to different fields per language.
      3. You can do it all by hand: duplicate properties for each language, and map each property to a different field using dedicated annotations. Wouldn't recommend that.
      4. If you have one entity instance per language, you can use the AlternativeBinder.
      5. If you store the data of all languages in the same entity instance, you can try something more fun with a generic embeddable and some custom annotation, e.g.:
        @Embedded
        @I18nFullTextField(
            en = @Localization(analyzer = ...),
            es = @Localization(analyzer = ...),
            jp = @Localization(analyzer = ...)
        )
        I18nData<String> summary;
        @Embedded
        @I18nFullTextField(
            name = "fullContent",
            bridge = ...,
            en = @Localization(analyzer = ..., searchAnalyzer = ...),
            es = @Localization(analyzer = ..., searchAnalyzer = ...),
            jp = @Localization(analyzer = ..., searchAnalyzer = ...)
        )
        I18nData<InputProvider> fullContentUrl;

        The annotation processor would take care of explicitly mapping every sub-property of I18nData (en, es, jp, ...) to a dedicated per-language field, with localization metadata (per-language analyzer) applied, and some prefix or suffix applied to the field name (summary_fr or fr.summary or whatever)