p-groarke / wsay

Windows "say"
BSD 3-Clause "New" or "Revised" License
141 stars 11 forks source link

wsay ignoring some local Slovak characters #2

Closed sovcik closed 4 years ago

sovcik commented 4 years ago

Hi, recently found your nice tool and tried to use it for my mini project. It works in general, but ignores some characters. E.g. character "á" which should sound like "aa" in word "naan" is completely ignored. I tried it via command line and also using your gui tool.

I assume it might be character encoding issue, but couldn't figure it out. When I used MS Word to read aloud, then this character was pronounced properly.

I even tried to pass it as speech XML, but that ignored language tags completely.

<?xml version="1.0" encoding="UTF-8"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">
  <s xml:lang="en-US">Naan</s>
  <s xml:lang="sk-SK">Nán</s>
</speak>
p-groarke commented 4 years ago

Hi sovcik, glad you find it useful.

For now, the command line doesn't support unicode characters. It is the next "big thing" I have to fix. In v1.3, I added unicode character support for text files.

Are you using text from a text file or typing it directly in the command line?

edit: Also, are you using a Slovak voice to read the text?

sovcik commented 4 years ago

Oh, I didn't notice that. Only now I see that comment in your source code :-) https://github.com/p-groarke/wsay/blob/master/src_cmd/main.cpp#L21 I tried both the command line and UTF8 encoded text file. Attaching used file for reference.

Jozef.

On Tue, Apr 14, 2020 at 4:47 PM p-groarke notifications@github.com wrote:

Hi sovcik, glad you find it useful.

For now, the command line doesn't support unicode characters. It is the next "big thing" I have to fix. In v1.3, I added unicode character support for text files.

Are you using text from a text file or typing it directly in the command line?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/p-groarke/wsay/issues/2#issuecomment-613486538, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATZSQKL2Z4QSTYCC5IBZ3RMRZO5ANCNFSM4MHR6SZA .

<?xml version="1.0" encoding="UTF-8"?> <speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" xml:lang="sk-SK">

This is in English: naan
</s>
<s xml:lang="sk-SK">
    <voice name="Filip">
        Toto je po Slovensky: nán
    </voice>
</s>

sovcik commented 4 years ago

yes, I was using -v 12, which is voice Filip in Windows 10

sovcik commented 4 years ago

btw. what is the environment/stack you are using for developing this project? I was considering helping you with that UTF8, but this does not look like VisualStudio project :-) I have quite extensive C++ experience, but for microcontrollers, so I was curious what do you use for Windows.

p-groarke commented 4 years ago

Haha you found the comment :)

I think I should be able to support the unicode space from utf pretty easily without having to rewrite my argument parsing lib (which I really don't want to do). What is strange is it should work with text file input, so I'll investigate what is going on there.

Thanks for offering help, I use cmake to generate the VS solution stuff. I wouldn't waste any time on this though, I'll fix it soon enough.

I'll let you know when I have a tentative fix with a build.

Cheers

sovcik commented 4 years ago

Thanks!

Found this https://github.com/huangqinjin/wmain From what I read it should just "decode" UTF8 command line and pass it to the original main.c as wstring, so no arg parsing changes needed :-) Maybe it will be of some help.

On Wed, Apr 15, 2020 at 8:10 PM p-groarke notifications@github.com wrote:

Haha you found the comment :)

I think I should be able to support the unicode space from utf pretty easily without having to rewrite my argument parsing lib (which I really don't want to do). What is strange is it should work with text file input, so I'll investigate what is going on there.

Thanks for offering help, I use cmake to generate the VS solution stuff. I wouldn't waste any time on this though, I'll fix it soon enough.

I'll let you know when I have a tentative fix with a build.

Cheers

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/p-groarke/wsay/issues/2#issuecomment-614195473, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATZSTICSFFKT3YHU4N67DRMX2CBANCNFSM4MHR6SZA .

p-groarke commented 4 years ago

Yep, I was looking at that as well 😊 If I use that, I have to adapt all the argument parsing to template out char and wchar_t.

Another option I’ve used in the past is setting a UTF8 mode in the command tool. Since utf8 is represented as multiple chars, I could also not change anything and just use a utf8 parsing lib or function to convert that to proper Unicode wchar_t.

TBD what I do. I’m guessing there is another issue somewhere since the text file isn’t working either.

p-groarke commented 4 years ago

Alright @sovcik , I have a test build for you if you wish. No pressure of course :)

This is a "quick fix" so you can at least continue using the tool.

  • File input should now truly support utf8. It also tries to read the bom and supports utf16 and utf32 (big and little endian).
  • The command line sentence character-set has been widened, but it isn't full utf8 yet. It uses the current code page to parse the text. This does support basic accents like áàéè etc.
  • Interactive mode wants none of it and is pure ASCII because std::wcin reasons ;)

I hope this is enough in the short term for your use case. Full support of utf8 will come, but will take more time as I have to refactor a lot of things.

Let me know if you have any issues with the build and thank you for taking the time to report this!

wsay_utf8_beta.zip

sovcik commented 4 years ago

Wow! Thanks! Works much better now. It is absolutely enough for what I need.

I tested it a bit and you are right some accents work, while others not. Examples of not working ones: ľ, ť, ň.

But as said earlier - works for me for now! Thanks a lot!

p-groarke commented 4 years ago

:) I'll use those characters as a unit test for the full utf8 support I'm working on. I'll keep this ticket open until that is ready. cheers

sovcik commented 4 years ago

Thanks

On Fri, Apr 17, 2020 at 3:52 PM p-groarke notifications@github.com wrote:

:) I'll use those characters as a unit test for the full utf8 support I'm working on. I'll keep this ticket open until that is ready. cheers

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/p-groarke/wsay/issues/2#issuecomment-615256238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAATZSXLDKGEZNFH655GIHDRNBNJLANCNFSM4MHR6SZA .

p-groarke commented 4 years ago

I released 1.4 with much better utf character support. Let me know if you find any issues!

Funny side effect, text input supports emojis lol.

sovcik commented 4 years ago

@p-groarke Works like a charm! Thanks!