Open tuzz opened 2 weeks ago
hi Chris,
thank you for the pull request. Give me some time to review. I'll come back to you as soon as possible.
regards,
Adam
No problem. Happy to answer any questions. If it's useful, I also tried to handle the case when text_offset=None
in my application. Here's my parsing code, although it probably doesn't belong in the SDK.
hi again,
sorry for delayed responses, I am pretty busy with my current projects right now :( It would really help if you could prepare some example demonstrating this feature. have a look into examples/synthesizer folder of this project. It would be really great if we had some meaningful example for this.
I must admin it has been long time since I actively worked with this library, I need to really see concrete example demonstrating this to understand what do we achieve by merging this feature.
thanks!
hi Chris,
weekend is here and I finally got some time to have a look at this. Please note this lib is port of go library
I made it work today and with some minor tweaks of this example I was able to synthetize into wav file this string: my cat's tail is rather long
result bellow:
Enter some text that you want to speak, or enter empty text to exit.
> my cat's tail is rather long
Synthesis started.
{handle:0x7fa3bc000bd0 AudioOffset:500000 Duration:200ms TextOffset:0 WordLength:2 Text:my BoundaryType:0}
{handle:0x7fa3bc000bd0 AudioOffset:2625000 Duration:387.5ms TextOffset:3 WordLength:5 Text:cat's BoundaryType:0}
{handle:0x7fa3bc000bd0 AudioOffset:6500000 Duration:425ms TextOffset:9 WordLength:4 Text:tail BoundaryType:0}
{handle:0x7fa3bc000bd0 AudioOffset:11500000 Duration:225ms TextOffset:14 WordLength:2 Text:is BoundaryType:0}
{handle:0x7fa3bc000bd0 AudioOffset:13750000 Duration:325ms TextOffset:17 WordLength:6 Text:rather BoundaryType:0}
{handle:0x7fa3bc000bd0 AudioOffset:17125000 Duration:437.5ms TextOffset:24 WordLength:4 Text:long BoundaryType:0}
Synthesizing, audio chunk size 37134.
Synthesizing, audio chunk size 32804.
Synthesizing, audio chunk size 5506.
Synthesizing, audio chunk size 2740.
Read [78000] bytes from audio data stream.
Enter some text that you want to speak, or enter empty text to exit.
> Synthesized, audio length 78046.
^Csignal: interrupt
now back to rust stuff. I have extended current examples, see here: https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/commit/4802be0e0f09dde424b7cc0ca6b83af793fdfb52
current version does not have Text and produces something like this (when running audio_data_stream::run_example().await;):
C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>cargo run --example synthesizer
Compiling cognitive-services-speech-sdk-rs v1.0.4 (C:\Users\adamb\dev\cognitive-services-speech-sdk-rs)
Finished dev [unoptimized + debuginfo] target(s) in 1.17s
Running `target\debug\examples\synthesizer.exe`
[2024-10-05T19:30:11Z INFO synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:30:11Z INFO synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: 0, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: 3, word_length: 5, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8395b80, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: 9, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c83962d0, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: 14, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: 17, word_length: 6, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: 24, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO synthesizer::audio_data_stream] got result!
[2024-10-05T19:30:11Z INFO synthesizer::audio_data_stream] example finished!
C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>
Your version produces this:
[2024-10-05T19:32:09Z INFO synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:09Z INFO synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:32:09Z INFO synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:10Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a2474776b0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: Some(0), word_length: 2, boundary_type: WordBoundary, text: "my" }
[2024-10-05T19:32:10Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: Some(3), word_length: 5, boundary_type: WordBoundary, text: "cat's" }
[2024-10-05T19:32:10Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: Some(9), word_length: 4, boundary_type: WordBoundary, text: "tail" }
[2024-10-05T19:32:10Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d90820, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: Some(14), word_length: 2, boundary_type: WordBoundary, text: "is" }
[2024-10-05T19:32:10Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d904e0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: Some(17), word_length: 6, boundary_type: WordBoundary, text: "rather" }
[2024-10-05T19:32:10Z INFO synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d908f0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: Some(24), word_length: 4, boundary_type: WordBoundary, text: "long" }
[2024-10-05T19:32:10Z INFO synthesizer::audio_data_stream] got result!
[2024-10-05T19:32:10Z INFO synthesizer::audio_data_stream] example finished!
having Text in the event is definitely beneficial (and consistent with latest go version, I like that) but I cannot somehow succeed with SSML string. When I do something like this (i.e. use SSML string and replace speak_text_async with speak_ssml_async ):
use super::helpers;
use log::*;
/// demonstrates how to store synthetized data easily via Audio Data Stream abstraction
#[allow(dead_code)]
pub async fn run_example() {
info!("---------------------------------------------------");
info!("running audio_data_stream example...");
info!("---------------------------------------------------");
//let text = "my cat's tail is rather long";
let text = "<speak xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' version='1.0' xml:lang='en-US'><voice name='en-GB-George'>my cat's tail is rather long</voice></speak>";
let (mut speech_synthesize, _) = helpers::speech_synthesizer();
helpers::set_callbacks(&mut speech_synthesize);
match speech_synthesize.speak_ssml_async(text).await {
Err(err) => error!("speak_text_async error {:?}", err),
Ok(result) => {
info!("got result!");
helpers::save_wav("c:/tmp/output2.wav", result).await;
}
}
info!("example finished!");
}
i will get empty file and no events:
[2024-10-05T19:39:33Z INFO synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:39:33Z INFO synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO synthesizer::audio_data_stream] got result!
[2024-10-05T19:39:33Z INFO synthesizer::audio_data_stream] example finished!
which brings me to my original question, could you prepare simple example (ideally analogy of synthesizer/speak_text_async.rs) where you can demonstrate the problem you are describing in your PR? thanks
Hi @adambezecny, thanks for looking into this. Sorry I haven't been responsive as I'm on annual leave at the moment - I'll take a look properly as soon as I'm back.
To quickly summarise the problem: it happens when the SSML contains escape characters like apos;
. The text
associated with the event is correctly returned as '
which is great, but the text_offset
field is incorrect. I think the Azure SDK is trying to convey that the timestamp doesn't relate to an exact substring of the SSML and it signals that by returning -1
(which is cast to an unsigned int, hence it is set to u32::MAX
).
I attempted to recover the correct offset (rather than u32::MAX
) in the code snippet in my comment above. I wasn't sure whether to add this into the cognitive-services-speech-sdk-rs
repository because it's custom code that I wrote that probably isn't implemented in the other SDK wrappers (I haven't checked the go one, but Python just returns -1
).
I think my code is correctly figuring out the right text_offset
and word_length
for this edge case, but I haven't tested it extensively so it might not be ready for inclusion yet if we did want to go down that route.
ok, provide please working example where this issue is demonstrated. as stated above I was not able to replicate it, probably just doing something wrong. Ideally add new example into existing examples. I would like to test it with code in main then in your branch and see the difference.
in general:
After digging into this issue I found that text_offset is set to -1 (u32::MAX) when the text doesn't exactly match a substring in the SSML. This also means we can't reliably extract the text by character indexes, so call the C API to do that.