ultrabug / mkdocs-static-i18n

MkDocs i18n plugin using static translation markdown files
https://ultrabug.github.io/mkdocs-static-i18n/
MIT License
228 stars 38 forks source link

fix sitemap entries #205

Closed kamilkrzyskow closed 1 year ago

kamilkrzyskow commented 1 year ago

Fixes #194 Fixed the formatting of the sitemap file together with the corrections to abide by Google's rules.

ultrabug commented 1 year ago

@kamilkrzyskow thanks a lot for your contribution

I'm under the impression that the <loc> is actually wrong after applying your PR, the default URL <loc> is gone and only the localized ones are present. If I read @jonaharagon's issue comment, we should get the three of them in a separate <url> block.

I get this diff:

--- site/sitemap.xml    2023-03-20 18:46:14.867879295 +0100
+++ /tmp/sitemap.xml    2023-03-20 18:45:18.755969331 +0100
@@ -1,27 +1,45 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
     <url>
-         <loc>https://ultrabug.github.io/mkdocs-static-i18n/</loc>
-         <lastmod>2023-03-20</lastmod>
-         <changefreq>daily</changefreq>
-         <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/"/>
-         <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/"/>
-         
-    </url>
-    <url>
-         <loc>https://ultrabug.github.io/mkdocs-static-i18n/topic1/named_file/</loc>
-         <lastmod>2023-03-20</lastmod>
-         <changefreq>daily</changefreq>
-         <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/topic1/named_file/"/>
-         <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/topic1/named_file/"/>
-         
-    </url>
-    <url>
-         <loc>https://ultrabug.github.io/mkdocs-static-i18n/topic2/</loc>
-         <lastmod>2023-03-20</lastmod>
-         <changefreq>daily</changefreq>
-         <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/topic2/"/>
-         <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/topic2/"/>
-         
+        <loc>https://ultrabug.github.io/mkdocs-static-i18n/en/</loc>
+        <lastmod>2023-03-20</lastmod>
+        <changefreq>daily</changefreq>
+        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/"/>
+        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/"/>
+    </url>
+    <url>
+        <loc>https://ultrabug.github.io/mkdocs-static-i18n/fr/</loc>
+        <lastmod>2023-03-20</lastmod>
+        <changefreq>daily</changefreq>
+        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/"/>
+        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/"/>
+    </url>
+    <url>
+        <loc>https://ultrabug.github.io/mkdocs-static-i18n/en/topic1/named_file/</loc>
+        <lastmod>2023-03-20</lastmod>
+        <changefreq>daily</changefreq>
+        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/topic1/named_file/"/>
+        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/topic1/named_file/"/>
+    </url>
+    <url>
+        <loc>https://ultrabug.github.io/mkdocs-static-i18n/fr/topic1/named_file/</loc>
+        <lastmod>2023-03-20</lastmod>
+        <changefreq>daily</changefreq>
+        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/topic1/named_file/"/>
+        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/topic1/named_file/"/>
+    </url>
+    <url>
+        <loc>https://ultrabug.github.io/mkdocs-static-i18n/en/topic2/</loc>
+        <lastmod>2023-03-20</lastmod>
+        <changefreq>daily</changefreq>
+        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/topic2/"/>
+        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/topic2/"/>
+    </url>
+    <url>
+        <loc>https://ultrabug.github.io/mkdocs-static-i18n/fr/topic2/</loc>
+        <lastmod>2023-03-20</lastmod>
+        <changefreq>daily</changefreq>
+        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/topic2/"/>
+        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/topic2/"/>
     </url>
 </urlset>
kamilkrzyskow commented 1 year ago

Interesting that we got different insight on it 🤔I get what you mean. I'll look into that. Based on the Google's docs I assumed that each of the hreflang links just needs to have their own <url>

Also my own page uses the default language and I got a default path in the sitemap.xml too 🤔

<url>
    <loc>https://gothic-modding-community.github.io/gmc/</loc>
    <lastmod>2023-03-20</lastmod>
    <changefreq>daily</changefreq>
    <xhtml:link rel="alternate" hreflang="en" href="https://gothic-modding-community.github.io/gmc/"/>
    <xhtml:link rel="alternate" hreflang="pl" href="https://gothic-modding-community.github.io/gmc/pl/"/>
    <xhtml:link rel="alternate" hreflang="cs" href="https://gothic-modding-community.github.io/gmc/cs/"/>
</url>
<url>
    <loc>https://gothic-modding-community.github.io/gmc/pl/</loc>
    <lastmod>2023-03-20</lastmod>
    <changefreq>daily</changefreq>
    <xhtml:link rel="alternate" hreflang="en" href="https://gothic-modding-community.github.io/gmc/"/>
    <xhtml:link rel="alternate" hreflang="pl" href="https://gothic-modding-community.github.io/gmc/pl/"/>
    <xhtml:link rel="alternate" hreflang="cs" href="https://gothic-modding-community.github.io/gmc/cs/"/>
</url>
<url>
    <loc>https://gothic-modding-community.github.io/gmc/cs/</loc>
    <lastmod>2023-03-20</lastmod>
    <changefreq>daily</changefreq>
    <xhtml:link rel="alternate" hreflang="en" href="https://gothic-modding-community.github.io/gmc/"/>
    <xhtml:link rel="alternate" hreflang="pl" href="https://gothic-modding-community.github.io/gmc/pl/"/>
    <xhtml:link rel="alternate" hreflang="cs" href="https://gothic-modding-community.github.io/gmc/cs/"/>
</url>

image

kamilkrzyskow commented 1 year ago

OK, the docs I'm working with use file.md, file.pl.md and file.cs.md which are all 3 unique cases for each of the languages. And the example docs use the default setting and reuse the en files, so you have en and fr but 3 languages selections. Therefore I guess my issue is that I unified it to use the for loc_file in file.alternates.values() method. In my case there are 3 alternates to a file, and 3 languages. In your case there are 2 alternates and 3 languages.

At least that is my assumption after comparing the config. I'll look deeper into a solution a little bit later.

kamilkrzyskow commented 1 year ago

Wait, if there are 2 versions for en, / and /en/ does it have to appear 2 times in hreflang alternates? Because if I'm understanding you correctly @ultrabug you want something like this:

I temporary created it manually

<url>
<loc>https://ultrabug.github.io/mkdocs-static-i18n/</loc>
<lastmod>2023-03-20</lastmod>
<changefreq>daily</changefreq>
<xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/"/>
<xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/"/>
<xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/"/>
</url>
<url>
<loc>https://ultrabug.github.io/mkdocs-static-i18n/en/</loc>
<lastmod>2023-03-20</lastmod>
<changefreq>daily</changefreq>
<xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/"/>
<xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/"/>
<xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/"/>
</url>
<url>
<loc>https://ultrabug.github.io/mkdocs-static-i18n/fr/</loc>
<lastmod>2023-03-20</lastmod>
<changefreq>daily</changefreq>
<xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/"/>
<xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/"/>
<xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/"/>
</url>

This doesn't look right to me 🤔 EDIT: This is the result from one of the validators from Google's docs image


EDIT2: OK, I think I finally understood it this time. I have to keep the default base <loc> with the alternates, but additionally create an <url> for each of the alternates 😮‍💨. image


EDIT3: I think I got the structure right, but the validator still has ❌😞For now I'm out of ideas sitemap.xml.txt image

kamilkrzyskow commented 1 year ago

I think it works. Based on this https://developers.google.com/search/docs/specialty/international/localized-versions#xdefault and this https://developers.google.com/search/docs/specialty/international/localized-versions#common-mistakes I figured out a hack. For the default main link it creates the x-default hreflang link. It's not completely up to the standard, as it seems that the x-default should be used for some kind of language selection URL, but here I'm using the link as a backup for the duplicate en / default locale.

The validators is also happy for my page where I'm using en with the / path. Same link with multiple locale tags seems to be fine. image

That is it from me on the matter today, I'm waiting for feedback 👋

jonaharagon commented 1 year ago

I don't think there should be a duplicate English site being built if the plugin is configured correctly? Maybe I'm wrong but I believe the default language should be set to en and then (importantly) the en language should be set to build: false so that the English site gets built only on / and not /en/ right?

Regardless, I don't think it's ideal to use x-default in that manner. I would think using en for both would be at least more accurate although probably would cause duplicate content issues (which is why I simply wouldn't build /en/ in the first place), unless you already tried that?

kamilkrzyskow commented 1 year ago

Thanks for your input @jonaharagon. Since a few months, our team stopped building the duplicate /en/ branch of the website to slightly improve build times. Like you said the default language is now only built to the / path. My initial solution was built around that in mind and the result was correct here https://github.com/ultrabug/mkdocs-static-i18n/pull/205#issuecomment-1476726282

However, by default the plugin doesn't prevent (and even encourages) using both the / and /lang_code/ paths for the default language. I've created duplicate entries manually and the validator marked it as invalid (first image here https://github.com/ultrabug/mkdocs-static-i18n/pull/205#issuecomment-1476760412)

I'm unsure now, whether to trust the validator or not😱, because it likes the x-default approach more, than the duplicated one.

EDIT: I've used the other tool recommended on the Google's docs page. The Sitemap generator generated duplicate entries just like you said:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <url>
        <loc>https://ultrabug.github.io/mkdocs-static-i18n/</loc>
        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/" />
        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/" />
        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/" />
    </url>
    <url>
        <loc>https://ultrabug.github.io/mkdocs-static-i18n/en/</loc>
        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/" />
        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/" />
        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/" />
    </url>
    <url>
        <loc>https://ultrabug.github.io/mkdocs-static-i18n/fr/</loc>
        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/" />
        <xhtml:link rel="alternate" hreflang="en" href="https://ultrabug.github.io/mkdocs-static-i18n/en/" />
        <xhtml:link rel="alternate" hreflang="fr" href="https://ultrabug.github.io/mkdocs-static-i18n/fr/" />
    </url>
</urlset>

EDIT2: But multiple sources claim that it's not completely wrong to use x-default for every page either 🤷. I guess this is one of those catch-22 situations 😞.


Anyways, I've made it so it creates duplicate hreflang entries instead of the x-default. This happens only when there is actually a difference between the default and alternate languages / and /en/. There are no duplicates for one path /. I've also updated the naming of variables yet again. The triple for loop could probably be simplified via some Jinja2 macro, but I never worked with such thing and I do not want to learn it right now. At least the formatting is cleaner than the initial one ✌️ Also since it creates duplicate entries now, maybe it's worth mentioning about it somewhere in the README or as a INFO Warning, not sure which one would be the best🤔

ultrabug commented 1 year ago

Maybe I'm wrong but I believe the default language should be set to en and then (importantly) the en language should be set to build: false so that the English site gets built only on / and not /en/ right?

However, by default the plugin doesn't prevent (and even encourages) using both the / and /lang_code/ paths for the default language. I've created duplicate entries manually and the validator marked it as invalid (first image here https://github.com/ultrabug/mkdocs-static-i18n/pull/205#issuecomment-1476760412)

Kamil is right, the plugin defaults to building everything but exposes an option to not build the /default_language/ version.

I'd do without the x-default as well to keep the structure as simple as possible.

Adding 3 alternate for each <url> seems the way to go. Can I ask that we only add the /default_language url or any other only if it's built (config build = true) @kamilkrzyskow ? (unless you already tested it and it works, I didnt check it yet)

kamilkrzyskow commented 1 year ago

:0 I didn't actually test it. It works when I run it now. But the working part is a side effect, I'm not sure why it works, because the pages.alternates should still be filled out in the suffix_structure on_files event, so the sitemap loops should still loop over the alternates, but they do not🤔

EDIT: https://github.com/ultrabug/mkdocs-static-i18n/blob/2d1f3c6330e92bbe9ce3c67f1e9e9657a0b19091/mkdocs_static_i18n/plugin.py#L264-L266

It skips a part of the on_config event, therefore it doesn't insert the new sitemap template anyways @ultrabug

EDIT2: 🤦I see that I misread. I did test the ability to turn off single languages throughout the whole process. So yes, if I set build: false it doesn't get added to the sitemap, BUT if instead of default_language_only the user sets all of the alternate languages to False, then the default language alternate is still being added.

- i18n:
  default_language: !ENV [GMC_DEFAULT_LANG, en]
  default_language_only: !ENV [GMC_ONLY_DEFAULT_LANG, False]
  languages:
    en:
      name: en - English
      # Disable the /en/ path, a duplicate of the `default_language` build.
      build: false
    pl:
      name: pl - Polski
      build: false
    cs:
      name: cs - Čeština
      build: false
  material_alternate: true
<url>
        <loc>https://gothic-modding-community.github.io/gmc/</loc>
        <lastmod>2023-03-21</lastmod>
        <changefreq>daily</changefreq>
        <xhtml:link rel="alternate" hreflang="en" href="https://gothic-modding-community.github.io/gmc/"/>
    </url>

So I'm not sure if that's supposed to be also cleaned up 🤔

kamilkrzyskow commented 1 year ago

I fixed the issue with the "hanging" alternate if there is only 1 language built, there will be no alternates hreflangs. However, when checking for different edge-cases I found a "bug" that has been there since before the PR. This config:

- i18n:
    default_language: !ENV [DEFAULT_LANGUAGE, "en"]
    default_language_only: !ENV [DEFAULT_LANGUAGE_ONLY, false]
    docs_structure: suffix
    languages:
      default:
        name: Default (en)
        build: false
      en:
        name: English
        build: true
        site_name: "MkDocs static i18n plugin demo (en)"
      fr:
        name: Français
        build: true
        site_name: "Démo du plugin MkDocs static i18n (fr)"

Creates an empty sitemap.xml, but then again the default language was set to false, so maybe MkDocs itself returns this sitemap, but there is no error etc. Anyways, I think it's not worth looking into more in-depth 👌.

ultrabug commented 1 year ago

Thanks for digging into this

We'll have to cover those scenarios with tests @kamilkrzyskow

kamilkrzyskow commented 1 year ago

Sure, I agree, since I stumbled across many traps at every step 😆. However, I'm unavailable for the rest of the day, so I'll get back to it in the evening or tomorrow. Additionally, I'll try to come up with a more generic less copy-paste approach, than the one I've seen in the search tests ✌️

ultrabug commented 1 year ago

Thank you.

Yes I had plans to rework the tests suite; I don't like it either.

ultrabug commented 1 year ago

The more I read about and test sitemaps the more I think that we should target something near what @jonaharagon was surprised about.

I'm also starting to wonder if my initial choice of building the default language version by default is sane for SEO wrt duplicated content...

I think we should strive to get a sitemap where we get a number of <link> equal to the number of languages configured; no more.

Which means that:

IF in the future we want to allow to build sites with language suffixes only (and NO default urls) then we'll have to adapt our sitemap logic to that, but that's a topic for later.

Thoughts?

ultrabug commented 1 year ago

2023-03-22-114356_2920x895_scrot

To be perfectly clear about what I mean

jonaharagon commented 1 year ago

If I were you what I would probably do is make it so that if build: true isn't explicitly set on the default language, then the default language is only built on .../; and if build: true is explicitly set on the default language, then the default language is only built on .../en/ and not .../ (i.e. nothing gets built on the root directory at all).

That way people can have a custom language switcher or language-based redirect on the root (configured separately with their webserver) with language paths for all languages if they choose, but otherwise it would work the way you describe.

And also that way, either way you configure it the default language is only ever being built once, which I think is ideal for SEO, and the plugin would only ever be aware of one version per language which should make the sitemap logic easier?

kamilkrzyskow commented 1 year ago

I turned off the /en/ path via build: false, because the language switcher didn't even allow to select it 😅, and all paths pointed to the / path, and the /en/ was just adding to the built time... So sure we could omit the /lang_code/ hreflang links if the default language has the same locale. This would avoid duplicate entries and the default language duplicate sub-path is "inaccessible" anyways from the site.

However, since I still have to figure out the tests (since I watched fitness band reviews for 6h instead of writing them 😓...) I will turn the PR to a draft.

kamilkrzyskow commented 1 year ago

I finally managed to glue some tests together @ultrabug. By default both suffix and folder configs are built (can be overridden). It's possible to override the default settings from the initial config. It's then possible to add more settings if the default one is lacking. I imagined a builder pattern design:

raw_config = base.create_config()
# Example function if the need arises to use such approach
base.add_plugin_to_config(raw_config, plugin_name, plugin_options)
base.add_new_language(raw_config, lang_code, lang_options)

for structure, mkdocs_config in base.load_configs(raw_config):
    # Add files to the current test based on the structure
    base.add_stuff_pre_build(structure, mkdocs_config)
    mkdocs_config.on_env(...)

    build(mkdocs_config)

    # Validate

I'm not sure how much cleaner it will be in the end when all of the functionality is in place 🤔, but I think it will be at least this much🤏better ✌️. The downside of the approach with both configs being built is the lower performance. However, thanks to the fact that all of such Temporary directory tests are separated you could run pytest-xdist with -n 4 or -n auto to make it run across different threads.


Anyways, let me know what you think if it's a valid approach then I will squash the PR into 3 commits.


Oh and also apart of the buggy default: build: false which creates an empty sitemap file, I also noticed that the link doesn't influence the build/sitemap. The language switcher changes to a different path, but the site is still built in the ./language_code path.

kamilkrzyskow commented 1 year ago

@ultrabug 🙋 Hi, just a reminder ping. I'm still waiting for feedback to my tests etc. ✌️

ultrabug commented 1 year ago

@kamilkrzyskow sorry I'm off for a few days, back on Monday.

I've actually been thinking about you inputs and @jonaharagon as well.

The current code base is hard to grasp, full of duplicate code and could be made simpler / better. You mentioned being able to run multiple mkdocs.commands.build and this idea was also on my mind at the start of the project but I failed to find a way to make it happen back then.

MkDocs 1.4+ has actually made things simpler so I took this idea for a new spin... and succeeded. I'm at the last bits of a major rewrite of the plugin, with simpler and yet more powerful configuration as well. This can actually become a v1.0 I think...

Would you be up to test it when I'm done? Should be next week for sure. I'll do a PR ofc.

kamilkrzyskow commented 1 year ago

:eyes: Complete rewrite sounds awesome, and almost done too 😮. Sure, I can test it. I'll still be relatively active/accessible during this month, in May a little bit less, and sine June very limited.

ultrabug commented 1 year ago

I'm happy to announce release v1.0.0 of the mkdocs-static-i18n plugin!

I'll soon add a CHANGELOG with other mentions, thanks a lot @kamilkrzyskow

kamilkrzyskow commented 1 year ago

I appreciate the words of gratitude @ultrabug, but I would have preferred you keeping your word from this comment: https://github.com/ultrabug/mkdocs-static-i18n/pull/216#issuecomment-1495455932

Even more importantly, I absolutely have no intention of merging this in place of anyone's contributions. Being a long time OSS author and contributor, I'm also very strict about that.

Yet you merged https://github.com/ultrabug/mkdocs-static-i18n/pull/216 into main without merging my changes to the sitemap... So my work will be forgotten (by me as well, since it took only a couple of hours), and even the git history won't remember.

I shall end the matter here and not mention in again, as an act of appreciation for your continued support of the plugin, but in the future, I will be surely more reluctant when it comes to PRs to this repository. :v: