thunderdrop / IBMTTSDictionaries

A large, community-driven pronunciation dictionary for the IBMTTS speech synthesizer in American English
Creative Commons Zero v1.0 Universal
23 stars 10 forks source link

ENUMain: Regarding incorrect abbreviations and domain names #13

Closed ultrasound1372 closed 1 year ago

ultrasound1372 commented 4 years ago

I've noticed a trend developing in ENU Main giving special pronunciations for improperly cased versions of many websites and a few acronyms. As the pronunciation of the website domain name is generally not what you see as the title of the page, I vote for these pronunciations to be removed. As for acronyms, I'm not totally sure on that one, even for some lowercase versions. I believe the addition of pronunciations for domain names just makes the dictionary unnecessarily large and puts an undo burden on the contributors, as virtually every website in existence with a multi-word name would have to be added. This then produces a heavy bias on the part of the contributor since this goal is unattainable. An argument can be made about the bias of the populus rather than the contributor, for certain sites of things like news organizations, but I believe the domain names should be removed, instead focusing on actual words one will encounter in general text. If ECI has broken handling of all-caps acronyms, their existence is justified.
cc @amirsol81 @thunderdrop

amirsol81 commented 4 years ago

@ultrasound1372 As someone who has done almost all of that , I agree wholeheartedly. However, I suggest that we maintain corrections for important and internationally-recognized news/technology websites. And for some of them the uppercase version is included because Eloquence mispronounces them, too.

On 9/25/2020 9:03 PM, Colton Hill wrote:

I've noticed a trend developing in ENU Main giving special pronunciations for improperly cased versions of many websites and a few acronyms. As the pronunciation of the website domain name is generally not what you see as the title of the page, I vote for these pronunciations to be removed. As for acronyms, I'm not totally sure on that one, even for some lowercase versions. I believe the addition of pronunciations for domain names just makes the dictionary unnecessarily large and puts an undo burden on the contributors, as virtually every website in existence with a multi-word name would have to be added. This then produces a heavy bias on the part of the contributor since this goal is unattainable. An argument can be made about the bias of the populus rather than the contributor, for certain sites of things like new organizations, but I believe the domain names should be removed, instead focusing on actual words one will encounter in general text. If ECI has broken handling of all-caps acronyms, their existence is justified. cc @amirsol81 https://github.com/amirsol81 @thunderdrop https://github.com/thunderdrop

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thunderdrop/IBMTTSDictionaries/issues/13, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGLZMXRHAG5ZXFJZPGKSZLSHTH5HANCNFSM4RZZPD4Q.

ultrasound1372 commented 4 years ago

Perhaps we should begin perging the main dictionary of many of these domain names?

amirsol81 commented 4 years ago

Interestingly, @thunderdrop added a domain name today . I think at least major news/tech/reference/educational ones should be retained.

On 9/29/2020 9:02 PM, Colton Hill wrote:

Perhaps we should begin perging the main dictionary of many of these domain names?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thunderdrop/IBMTTSDictionaries/issues/13#issuecomment-700867279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGLZMVPO4XGMJTOPOY2XTLSIIK33ANCNFSM4RZZPD4Q.

thunderdrop commented 4 years ago

Hmm, not me. I added some terms I often hear in Linux which are uncommon enough that they wouldn't conflict with dictionary words, No domains.

As for removing them, I'm not sure. Yes, adding them does create a bias, but that's only because there are so few on the project at present. If we had more people, we'd have a bigger sample. After all, our whole job here is tracking down things eloquence can't pronounce. Sorry I don't have any useful input, perhaps we need to chuck this around a bit more.

ultrasound1372 commented 4 years ago

I just don't see it as necessary, as no synth will pronounce these, these are spellings that exist only because the DNS is case insensative after all. As an example, howtogeek. When you go to the website, the page title is How-To Geek. Or thefreedictionary for The Free Dictionary. And re chmod, do we know if linux people say ch mod, ch mode, or chmode?

amirsol81 commented 4 years ago

@ultrasound1372 For the record, it is not that no other synth pronounces them properly. For instance, both MS SAPI 5 and MS OneCore voices pronounce "howtogeek" correctly. In fact, OneCore voices handle most of them quite gracefully. On 9/30/2020 2:57 AM, Colton Hill wrote:

I just don't see it as necessary, as no synth will pronounce these, these are spellings that exist only because the DNS is case insensative after all. As an example, howtogeek. When you go to the website, the page title is How-To Geek. Or thefreedictionary for The Free Dictionary. And re chmod, do we know if linux people say ch mod, ch mode, or chmode?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/thunderdrop/IBMTTSDictionaries/issues/13#issuecomment-701045255, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIGLZMQI335V4O6QZR3MPH3SIJUPTANCNFSM4RZZPD4Q.

amirsol81 commented 1 year ago

@ultrasound1372 I finally managed to remove all of these, so the issue is being closed.