jonasmerlin / astro-seo

Makes it easy to add information that is relevant for SEO to your Astro app.
MIT License
908 stars 50 forks source link

Title with unicode doesn't work #90

Open mrcnski opened 5 months ago

mrcnski commented 5 months ago

Thanks for the utility! I think I've found a bug. It seems that setting the <SEO title= field to some unicode breaks the generated HTML:

Screenshot 2024-02-22 at 15 26 40
ttmc commented 5 months ago

I think this might be because astro-seo currently puts the <meta charset="UTF-8" /> tag after the title tag. I'm going to create a separate issue for that.

jonasmerlin commented 5 months ago

Thank you for reporting this @mrcnski and really good catch @ttmc, we'll discuss possible solutions to this in #91

Just to confirm that this might actually be the issue behind this: do you use the charset tag via astro-seo @mrcnski?

mrcnski commented 5 months ago

Hey @jonasmerlin, I didn't set charset because I thought that it would default to UTF-8 if not provided. If that's not true, maybe the docs could be clarified?

ttmc commented 5 months ago

@mrcnski While modern browsers like Chrome strongly prefer UTF-8 as the default charset for websites without explicit declaration, there isn't a single guaranteed assumption. It's a multi-step process with multiple fallback options... here's what Chrome does:

  1. Byte Order Mark (BOM): Chrome first checks if the website content starts with a Byte Order Mark, which is a sequence of bytes indicating the specific encoding used. If a UTF-8 BOM is present, the browser assumes UTF-8 encoding.
  2. HTTP Headers: If no BOM is found, Chrome looks for the Content-Type header in the HTTP response. This header can explicitly specify the charset used, and if it mentions UTF-8, that will be used.
  3. Meta Tag: If neither BOM nor the Content-Type header provides a clear answer, Chrome checks for a <meta charset="utf-8"> tag within the HTML document itself. If present, this explicitly declares UTF-8 as the encoding.
  4. Heuristic Detection: If none of the above methods provide a clear indication, Chrome attempts to "guess" the charset based on heuristics and statistical analysis of the content itself. This involves looking for patterns and similarities with known encodings, but it's not always accurate and can lead to misinterpretations, especially for content containing characters from multiple languages.
  5. Fallback Default: If all attempts to identify the charset fail, Chrome resorts to a fallback default encoding. This is implementation-dependent and can vary across different browsers and even browser versions. However, for Chrome, the fallback default is generally the user's operating system default encoding, which might be something like Windows-1252 or ISO-8859-1 depending on the user's system configuration.

However, relying on browser guessing and fallback defaults is strongly discouraged for several reasons:

Therefore, it's essential for website developers to explicitly declare the character encoding using either the Content-Type header or the <meta charset> tag, preferably using UTF-8 due to its widespread adoption and compatibility.

mrcnski commented 5 months ago

Thanks for the detailed explanation @ttmc. For some reason I thought that astro-seo would set a default of UTF-8 if this field was omitted. I can see why we wouldn't want to set default any values, as they may be already set outside of the component. My mistake, but maybe this part of the README could be clarified slightly?

Set the charset of the document. In almost all cases this should be UTF-8.

ttmc commented 5 months ago

@mrcnski Over on issue #91, one suggestion has been to make the astro-seo integration check if the charset gets set to UTF-8 (by something or someone other than the astro-seo integration), and if it hasn't been set, then the integration will inject the charset declaration at the top of the head.