Fails when attempting to read Norwegian

bruce133 commented 1 year ago

Expected behavior

Should use an available Norwegian voice when trying to read Norwegian text.

Actual behavior

Plays the error message "Sorry, no available voice language detected for the selected text."

Steps to reproduce behavior

Install a Norwegian voice.
Select any piece of text on any Norwegian site and have Talkie read it.

Examples of Norwegian websites

Text and language

Which part of the website did you want spoken: Any
If possible, include the full text: ...
Language the text was expected to be spoken in: Norwegian

System information

Your browser: Chrome
Your browser version: 109.0.5414.120 (Official Build) (64-bit)
Your operating system: Windows
Your operating system version: 10

Additional information

The problem is likely caused by the language being detected as "no", while the ISO language code used by Talkie for Norwegian is "nb". Note that "nb" is actually the ISO code for Norwegian Bokmål, which is one of two official written standards in Norway.

Here are the three Norwegian ISO codes, sourced from https://www.w3schools.com/tags/ref_language_codes.asp:

It might also be worth mentioning that you'll likely see different ISO codes being used on Norwegian sites; on the three examples provided, different HTML lang attribute values are being used:

joelpurra commented 10 months ago

@bruce133: yes, the language code variants may cause problems when matching website metadata languages and browser language detection routines against voices' built-in language codes. Norwegian may be a trickier case than usual, depending on how important the differences between no/nb/nn language codes are.

Thank you for providing three different example URLs. It would certainly help usability if Talkie "just works" on "all" Norwegian websites.

Would voices from different Norwegian variants have an audible difference on these three example websites?
Do you think it would be considered alright, in practical terms, for Norwegian readers/speakers if no/nb/nn voices were used interchangeably?
Which language code is "canonical" among no/nb/nn? Perhaps no?

One approach for Talkie is to attempt to expand all Norwegian language codes to all three variants, and then find voices which matches either. A perfect match (nn voice for nn website) could be prioritized, but then configuring and applying a single preferred Norwegian voice (if it was not also nn) would not work. Adhering to the user's preferred Norwegian voice seems more important than a "perfect" match, would you agree?

Thus all Norwegian voices need to be treated under a single language code, meaning that rather than expanding to all language code variants it is easier to reduce to a single "canonical" code no. Internal voice objects could store both the original and parsed language codes. Websites which use nb/nn would have log output indicating that they have been normalized to no.

Dialects and regions may also need parsing and special handling, but the example nb-no would still work as no-no. Perhaps there are other combinations which do not.

The same should be applied to other languages where mappings are needed, both for parsing website and voice languages.

https://github.com/joelpurra/talkie/tree/v7.0.1/code/packages/shared-locales/src/data/_locales/nb

You are right in that Talkie's user interface uses nb for the Norwegian user interface locale/translation. Browser extension locales is a separate system which has to map to system locale expectations through Google Chrome, Mozilla Firefox, etcetera. It should not affect voice language mapping.

https://developer.mozilla.org/en-US/docs/Web/API/SpeechSynthesisVoice/lang
https://en.wikipedia.org/wiki/IETF_language_tag

IETF language tags combine subtags from other standards such as ISO 639, ISO 15924, ISO 3166-1 and UN M.49.

Note that voices' self-specified language does not use "plain" ISO 639 but IETF BCP 47, which complicates things further. Talkie itself is fairly neutral with regards to matching website/voice language codes. Most matches are made against the major language code, since relatively few websites specify a dialect.

https://github.com/joelpurra/talkie/blob/v7.0.1/code/packages/browser-background/src/language-helper.mts#L63-L72

Talkie already has some rudimentary language code 1:1 mapping for issues I stumbled upon myself during testing. For Norwegian variants a single 1:1 mapping would not be enough, but perhaps two mappings would.

I have previously looked at using wooorm's BCP-47 library to parse language tags, at least to reliably extract language/region for some mislabeled voices. It includes limited mapping for Norwegian: no-bok → nb and no-nyn → nn. This is also not enough for Talkie's use-case.

bruce133 commented 10 months ago

Would voices from different Norwegian variants have an audible difference on these three example websites?
- Short answer: I would say yes, most likely.
- Longer answer: Although the variants are written standards, I believe Norwegian dialects are normally associated with one over the other. Therefore, the dialect that you choose for a voice will decide which written standard is associated.
Do you think it would be considered alright, in practical terms, for Norwegian readers/speakers if no/nb/nn voices were used interchangeably?
- While most Norwegians would likely understand the speech, I believe it would be odd if e.g. a typical Oslo dialect was used to read Nynorsk. However, as long as the user can choose their preferred voice, then it would probably be fine. Edit: Also, most Norwegian written content on the web is probably in Bokmål, so this would likely be more of a niche occurence.
Which language code is "canonical" among no/nb/nn? Perhaps no?
- Well, the language code would be 'no', but if you mean which written standard is most used, then the answer would be Bokmål ('nb'). According to the national statistical institute, as many as 87.3% of school children had Bokmål as their chosen written standard in 2021 (source: https://www.ssb.no/utdanning/grunnskoler/statistikk/elevar-i-grunnskolen/artikler/1-av-10-har-nynorsk-som-hovudmal-i-skolen). Therefore, if you were to choose just one, I'd recommend Bokmål.

Have a look at https://www.synthesia.io/features/languages; they only list the variant "Norwegian - Natural", which is really a typical Oslo dialect, associated with Bokmål. Normalizing in this manner would probably be the most efficient.

joelpurra / talkie