jabber-tools / cognitive-services-speech-sdk-rs

Apache License 2.0
24 stars 15 forks source link

Small improvements to word boundary events #25

Open tuzz opened 2 weeks ago

tuzz commented 2 weeks ago

After digging into this issue I found that text_offset is set to -1 (u32::MAX) when the text doesn't exactly match a substring in the SSML. This also means we can't reliably extract the text by character indexes, so call the C API to do that.

adambezecny commented 2 weeks ago

hi Chris,

thank you for the pull request. Give me some time to review. I'll come back to you as soon as possible.

regards,

Adam

tuzz commented 2 weeks ago

No problem. Happy to answer any questions. If it's useful, I also tried to handle the case when text_offset=None in my application. Here's my parsing code, although it probably doesn't belong in the SDK.

timestamp.rs ```rust use serde::Serialize; use cognitive_services_speech_sdk_rs as azure; #[derive(Serialize, Debug)] pub struct Timestamp { start_index: u32, end_index: u32, #[serde(serialize_with = "Timestamp::to_seconds_f64")] start_time: u64, #[serde(serialize_with = "Timestamp::to_seconds_f64")] duration: u64, text: String, ssml: String, } impl Timestamp { pub fn parse_azure_word_boundary_events(events: &[azure::speech::SpeechSynthesisWordBoundaryEvent], ssml: &str) -> Vec { let mut timestamps: Vec = vec![]; for (i, event) in events.iter().enumerate() { let azure::speech::SpeechSynthesisWordBoundaryEvent { text_offset, word_length, audio_offset, duration_ms, boundary_type, text, .. } = event; // If the event's text is "cat's" but the SSML contained "cat's" then the event's text_offset will be None. let (start_index, end_index) = match text_offset { Some(offset) => (*offset, offset + word_length), None => Self::next_word_indexes(×tamps, ssml, events[i + 1..].iter().find_map(|e| e.text_offset)), }; let timestamp = Timestamp { start_index, end_index, start_time: *audio_offset, duration: *duration_ms, text: text.to_string(), ssml: ssml.chars().skip(start_index as usize).take((end_index - start_index) as usize).collect(), }; match boundary_type { azure::common::SpeechSynthesisBoundaryType::PunctuationBoundary => Self::amend(timestamp, &mut timestamps), azure::common::SpeechSynthesisBoundaryType::SentenceBoundary => Self::amend(timestamp, &mut timestamps), azure::common::SpeechSynthesisBoundaryType::WordBoundary => timestamps.push(timestamp), } } timestamps } // Scan the SSML for the trimmed sequence after the end of the previous word and before start of the next word. fn next_word_indexes(timestamps: &[Timestamp], ssml: &str, start_of_next_word: Option) -> (u32, u32) { let end_of_prev_word = timestamps.last().map_or(0, |previous| previous.end_index); let start_of_next_word = start_of_next_word.or_else(|| ssml.find("").map(|i| i as u32)).unwrap_or_else(|| ssml.chars().count() as u32); let in_between_len = start_of_next_word - end_of_prev_word; let mut in_between = ssml.chars().enumerate().skip(end_of_prev_word as usize).take(in_between_len as usize); let start_of_this_word = in_between.find(|(_, c)| !c.is_whitespace()).map_or(end_of_prev_word, |(i, _)| i as u32); let end_of_this_word = in_between.filter(|(_, c)| !c.is_whitespace()).last().map_or(start_of_next_word, |(i, _)| i as u32 + 1); (start_of_this_word, end_of_this_word) } fn amend(timestamp: Timestamp, timestamps: &mut Vec) { if let Some(previous) = timestamps.last_mut() { previous.end_index = timestamp.end_index; previous.duration += timestamp.duration; previous.text.push_str(×tamp.text); previous.ssml.push_str(×tamp.ssml); } else { timestamps.push(timestamp); } } // Keep start_time and duration as u64 to avoid floating point addition. Serialize to seconds at the end. fn to_seconds_f64(seconds: &u64, serializer: S) -> Result where S: serde::Serializer { serializer.serialize_f64(*seconds as f64 / 10_000_000.0) } } ```
adambezecny commented 1 week ago

hi again,

sorry for delayed responses, I am pretty busy with my current projects right now :( It would really help if you could prepare some example demonstrating this feature. have a look into examples/synthesizer folder of this project. It would be really great if we had some meaningful example for this.

I must admin it has been long time since I actively worked with this library, I need to really see concrete example demonstrating this to understand what do we achieve by merging this feature.

thanks!

adambezecny commented 5 days ago

hi Chris,

weekend is here and I finally got some time to have a look at this. Please note this lib is port of go library

I made it work today and with some minor tweaks of this example I was able to synthetize into wav file this string: my cat's tail is rather long

result bellow:

Enter some text that you want to speak, or enter empty text to exit.
> my cat's tail is rather long
Synthesis started.

{handle:0x7fa3bc000bd0 AudioOffset:500000 Duration:200ms TextOffset:0 WordLength:2 Text:my BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:2625000 Duration:387.5ms TextOffset:3 WordLength:5 Text:cat's BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:6500000 Duration:425ms TextOffset:9 WordLength:4 Text:tail BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:11500000 Duration:225ms TextOffset:14 WordLength:2 Text:is BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:13750000 Duration:325ms TextOffset:17 WordLength:6 Text:rather BoundaryType:0}

{handle:0x7fa3bc000bd0 AudioOffset:17125000 Duration:437.5ms TextOffset:24 WordLength:4 Text:long BoundaryType:0}
Synthesizing, audio chunk size 37134.
Synthesizing, audio chunk size 32804.
Synthesizing, audio chunk size 5506.
Synthesizing, audio chunk size 2740.
Read [78000] bytes from audio data stream.
Enter some text that you want to speak, or enter empty text to exit.
> Synthesized, audio length 78046.
^Csignal: interrupt

now back to rust stuff. I have extended current examples, see here: https://github.com/jabber-tools/cognitive-services-speech-sdk-rs/commit/4802be0e0f09dde424b7cc0ca6b83af793fdfb52

current version does not have Text and produces something like this (when running audio_data_stream::run_example().await;):

C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>cargo run --example synthesizer                                                                                                                       
   Compiling cognitive-services-speech-sdk-rs v1.0.4 (C:\Users\adamb\dev\cognitive-services-speech-sdk-rs)
    Finished dev [unoptimized + debuginfo] target(s) in 1.17s                                                                                                                                            
     Running `target\debug\examples\synthesizer.exe`
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: 0, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: 3, word_length: 5, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8395b80, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: 9, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c83962d0, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: 14, word_length: 2, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: 17, word_length: 6, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x231c8396200, release_fn: 0x7ff60da59ba8, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: 24, word_length: 4, boundary_type: WordBoundary }
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:30:11Z INFO  synthesizer::audio_data_stream] example finished!

C:\Users\adamb\dev\cognitive-services-speech-sdk-rs>

Your version produces this:

[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:32:09Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a2474776b0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 500000, duration_ms: 2000000, text_offset: Some(0), word_length: 2, boundary_type: WordBoundary, text: "my" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 2625000, duration_ms: 3875000, text_offset: Some(3), word_length: 5, boundary_type: WordBoundary, text: "cat's" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a247476f20, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 6500000, duration_ms: 4250000, text_offset: Some(9), word_length: 4, boundary_type: WordBoundary, text: "tail" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d90820, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 11500000, duration_ms: 2250000, text_offset: Some(14), word_length: 2, boundary_type: WordBoundary, text: "is" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d904e0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 13750000, duration_ms: 3250000, text_offset: Some(17), word_length: 6, boundary_type: WordBoundary, text: "rather" }
[2024-10-05T19:32:10Z INFO  synthesizer::helpers] >set_synthesizer_word_boundary_cb SpeechSynthesisWordBoundaryEvent { handle: SmartHandle { inner: 0x2a248d908f0, release_fn: 0x7ff76828a2be, name: "SpeechSynthesisWordBoundaryEvent" }, audio_offset: 17125000, duration_ms: 4375000, text_offset: Some(24), word_length: 4, boundary_type: WordBoundary, text: "long" }
[2024-10-05T19:32:10Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:32:10Z INFO  synthesizer::audio_data_stream] example finished!

having Text in the event is definitely beneficial (and consistent with latest go version, I like that) but I cannot somehow succeed with SSML string. When I do something like this (i.e. use SSML string and replace speak_text_async with speak_ssml_async ):

use super::helpers;
use log::*;

/// demonstrates how to store synthetized data easily via Audio Data Stream abstraction
#[allow(dead_code)]
pub async fn run_example() {
    info!("---------------------------------------------------");
    info!("running audio_data_stream example...");
    info!("---------------------------------------------------");

    //let text = "my cat's tail is rather long";
    let text = "<speak xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' version='1.0' xml:lang='en-US'><voice name='en-GB-George'>my cat&apos;s tail is rather long</voice></speak>";

    let (mut speech_synthesize, _) = helpers::speech_synthesizer();

    helpers::set_callbacks(&mut speech_synthesize);

    match speech_synthesize.speak_ssml_async(text).await {
        Err(err) => error!("speak_text_async error {:?}", err),
        Ok(result) => {
            info!("got result!");
            helpers::save_wav("c:/tmp/output2.wav", result).await;
        }
    }

    info!("example finished!");
}

i will get empty file and no events:

[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] running audio_data_stream example...
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] ---------------------------------------------------
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] got result!
[2024-10-05T19:39:33Z INFO  synthesizer::audio_data_stream] example finished!

which brings me to my original question, could you prepare simple example (ideally analogy of synthesizer/speak_text_async.rs) where you can demonstrate the problem you are describing in your PR? thanks

tuzz commented 5 days ago

Hi @adambezecny, thanks for looking into this. Sorry I haven't been responsive as I'm on annual leave at the moment - I'll take a look properly as soon as I'm back.

To quickly summarise the problem: it happens when the SSML contains escape characters like apos;. The text associated with the event is correctly returned as ' which is great, but the text_offset field is incorrect. I think the Azure SDK is trying to convey that the timestamp doesn't relate to an exact substring of the SSML and it signals that by returning -1 (which is cast to an unsigned int, hence it is set to u32::MAX).

I attempted to recover the correct offset (rather than u32::MAX) in the code snippet in my comment above. I wasn't sure whether to add this into the cognitive-services-speech-sdk-rs repository because it's custom code that I wrote that probably isn't implemented in the other SDK wrappers (I haven't checked the go one, but Python just returns -1).

I think my code is correctly figuring out the right text_offset and word_length for this edge case, but I haven't tested it extensively so it might not be ready for inclusion yet if we did want to go down that route.

adambezecny commented 2 days ago

ok, provide please working example where this issue is demonstrated. as stated above I was not able to replicate it, probably just doing something wrong. Ideally add new example into existing examples. I would like to test it with code in main then in your branch and see the difference.

in general:

  • adding text into event is great and I will definitely merge it
  • other change, not sure yet but probably not. I want to keep it consistent with other SDKs