Open bruce133 opened 1 year ago
@bruce133: yes, the language code variants may cause problems when matching website metadata languages and browser language detection routines against voices' built-in language codes. Norwegian may be a trickier case than usual, depending on how important the differences between no
/nb
/nn
language codes are.
Thank you for providing three different example URLs. It would certainly help usability if Talkie "just works" on "all" Norwegian websites.
no
/nb
/nn
voices were used interchangeably?no
/nb
/nn
? Perhaps no
?One approach for Talkie is to attempt to expand all Norwegian language codes to all three variants, and then find voices which matches either. A perfect match (nn
voice for nn
website) could be prioritized, but then configuring and applying a single preferred Norwegian voice (if it was not also nn
) would not work. Adhering to the user's preferred Norwegian voice seems more important than a "perfect" match, would you agree?
Thus all Norwegian voices need to be treated under a single language code, meaning that rather than expanding to all language code variants it is easier to reduce to a single "canonical" code no
. Internal voice objects could store both the original and parsed language codes. Websites which use nb
/nn
would have log output indicating that they have been normalized to no
.
Dialects and regions may also need parsing and special handling, but the example nb-no
would still work as no-no
. Perhaps there are other combinations which do not.
The same should be applied to other languages where mappings are needed, both for parsing website and voice languages.
You are right in that Talkie's user interface uses nb
for the Norwegian user interface locale/translation. Browser extension locales is a separate system which has to map to system locale expectations through Google Chrome, Mozilla Firefox, etcetera. It should not affect voice language mapping.
IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 and UN M.49.
Note that voices' self-specified language does not use "plain" ISO 639 but IETF BCP 47, which complicates things further. Talkie itself is fairly neutral with regards to matching website/voice language codes. Most matches are made against the major language code, since relatively few websites specify a dialect.
Talkie already has some rudimentary language code 1:1 mapping for issues I stumbled upon myself during testing. For Norwegian variants a single 1:1 mapping would not be enough, but perhaps two mappings would.
I have previously looked at using wooorm's BCP-47 library to parse language tags, at least to reliably extract language/region for some mislabeled voices. It includes limited mapping for Norwegian: no-bok
→ nb
and no-nyn
→ nn
. This is also not enough for Talkie's use-case.
Have a look at https://www.synthesia.io/features/languages; they only list the variant "Norwegian - Natural", which is really a typical Oslo dialect, associated with Bokmål. Normalizing in this manner would probably be the most efficient.
Expected behavior
Should use an available Norwegian voice when trying to read Norwegian text.
Actual behavior
Plays the error message "Sorry, no available voice language detected for the selected text."
Steps to reproduce behavior
Examples of Norwegian websites
Text and language
System information
Additional information
The problem is likely caused by the language being detected as "no", while the ISO language code used by Talkie for Norwegian is "nb". Note that "nb" is actually the ISO code for Norwegian Bokmål, which is one of two official written standards in Norway.
Here are the three Norwegian ISO codes, sourced from https://www.w3schools.com/tags/ref_language_codes.asp:
Language| ISO Code -- | -- Norwegian | no Norwegian bokmål | nb Norwegian nynorsk | nnIt might also be worth mentioning that you'll likely see different ISO codes being used on Norwegian sites; on the three examples provided, different HTML
URL| HTML lang -- | -- https://no.wikipedia.org/wiki/Norsk | nb https://snl.no/norsk | no https://www.dagbladet.no/ | nb-nolang
attribute values are being used: