joelpurra / talkie

Text-to-speech browser extension button. Select text on any web page, and have the computer read it out loud for you by simply clicking the Talkie button.
https://joelpurra.com/projects/talkie/
GNU General Public License v3.0
70 stars 17 forks source link

Fails when attempting to read Norwegian #42

Open bruce133 opened 1 year ago

bruce133 commented 1 year ago

Expected behavior

Should use an available Norwegian voice when trying to read Norwegian text.

Actual behavior

Plays the error message "Sorry, no available voice language detected for the selected text."

Steps to reproduce behavior

  1. Install a Norwegian voice.
  2. Select any piece of text on any Norwegian site and have Talkie read it.

Examples of Norwegian websites

Text and language

System information

Additional information

The problem is likely caused by the language being detected as "no", while the ISO language code used by Talkie for Norwegian is "nb". Note that "nb" is actually the ISO code for Norwegian Bokmål, which is one of two official written standards in Norway.

Here are the three Norwegian ISO codes, sourced from https://www.w3schools.com/tags/ref_language_codes.asp:

Language| ISO Code -- | -- Norwegian | no Norwegian bokmål | nb Norwegian nynorsk | nn

It might also be worth mentioning that you'll likely see different ISO codes being used on Norwegian sites; on the three examples provided, different HTML lang attribute values are being used:

URL| HTML lang -- | -- https://no.wikipedia.org/wiki/Norsk | nb https://snl.no/norsk | no https://www.dagbladet.no/ | nb-no
joelpurra commented 10 months ago

@bruce133: yes, the language code variants may cause problems when matching website metadata languages and browser language detection routines against voices' built-in language codes. Norwegian may be a trickier case than usual, depending on how important the differences between no/nb/nn language codes are.

Thank you for providing three different example URLs. It would certainly help usability if Talkie "just works" on "all" Norwegian websites.


One approach for Talkie is to attempt to expand all Norwegian language codes to all three variants, and then find voices which matches either. A perfect match (nn voice for nn website) could be prioritized, but then configuring and applying a single preferred Norwegian voice (if it was not also nn) would not work. Adhering to the user's preferred Norwegian voice seems more important than a "perfect" match, would you agree?

Thus all Norwegian voices need to be treated under a single language code, meaning that rather than expanding to all language code variants it is easier to reduce to a single "canonical" code no. Internal voice objects could store both the original and parsed language codes. Websites which use nb/nn would have log output indicating that they have been normalized to no.

Dialects and regions may also need parsing and special handling, but the example nb-no would still work as no-no. Perhaps there are other combinations which do not.

The same should be applied to other languages where mappings are needed, both for parsing website and voice languages.


You are right in that Talkie's user interface uses nb for the Norwegian user interface locale/translation. Browser extension locales is a separate system which has to map to system locale expectations through Google Chrome, Mozilla Firefox, etcetera. It should not affect voice language mapping.

Note that voices' self-specified language does not use "plain" ISO 639 but IETF BCP 47, which complicates things further. Talkie itself is fairly neutral with regards to matching website/voice language codes. Most matches are made against the major language code, since relatively few websites specify a dialect.

Talkie already has some rudimentary language code 1:1 mapping for issues I stumbled upon myself during testing. For Norwegian variants a single 1:1 mapping would not be enough, but perhaps two mappings would.

I have previously looked at using wooorm's BCP-47 library to parse language tags, at least to reliably extract language/region for some mislabeled voices. It includes limited mapping for Norwegian: no-boknb and no-nynnn. This is also not enough for Talkie's use-case.

bruce133 commented 10 months ago

Have a look at https://www.synthesia.io/features/languages; they only list the variant "Norwegian - Natural", which is really a typical Oslo dialect, associated with Bokmål. Normalizing in this manner would probably be the most efficient.