lumaku / ctc-segmentation

Segment an audio file and obtain utterance alignments. (Python package)
Apache License 2.0
321 stars 29 forks source link

Question: Is text sentence segmentation required? #26

Closed Slyne closed 1 year ago

Slyne commented 1 year ago

Hi, I have really long audio files (1hr) and a subtitle file with thousands of words. However there's no punctuation in the subtitle file. I've read nemo implementation and it requires the text/transcript to be separated by punctuation symbols such as '.'.

  1. Was wondering can I just use the long subtitle file? (Based on the paper, it doesn't seem to mention this issue.. Please correct me if I'm wrong.)
  2. Why text should be splitted into sentences if we can't use long subtitle file?

Thanks!

lumaku commented 1 year ago

Hey there! If you have a long text, then you probably want to split it into smaller utterances. How you partition your text into utterances is totally up to you. Using punctuation marks as a separator is only one way of many to partition the text. If you already have a subtitle file with timings, you should already have a text that is split into utterances? Can you give an excerpt of your subtitle file (the first 100 characters or the first 10 lines)?

It is possible to run the full text on the algorithm and then use the token timings (see the output in the variable timings). However, usually CTC segmentation is employed to generate utterance timings. These utterances are usually only one sentence long. Utterance alignments a can be used for training of a neural network, or use the alignments as subtitle timings. You would not want a very long text as a subtitle?

The CTC segmentation algorithm should not have issues with longer audio files. I tested it with audio files of up to 8h of speech. Usually the ASR model is limiting rather than CTC segmentation; however, there are some workarounds by splitting the audio into smaller parts: You could take a look at Jtubespeech, that aligns Youtube subtitles. Hope that helps, Regards

Slyne commented 1 year ago

Really appreciate your explanation. The subtitle excerpt:

1
00:00:17,940 --> 00:00:22,900
so you went on holiday. I went on a beach
holiday and I went jogging every single

2
00:00:22,900 --> 00:00:26,230
morning and this poor fisherman I just
saw him sitting there catching his one

3
00:00:26,230 --> 00:00:31,120
little fish and then he goes home to feed
his family day after day. I thought to myself I

4
00:00:31,120 --> 00:00:35,230
can enrich this guy's life, he's leading this simple simple life and I'm obviously

5
00:00:35,230 --> 00:00:40,030
used to all the luxuries and I suggested
why don't you catch a few fish every single

6
00:00:40,030 --> 00:00:48,190
day, why only one?  So the
fisherman I think replied, but why should

7
00:00:48,190 --> 00:00:52,570
I catch more than two fish? Well then
you can eat as much as your family

8
00:00:52,570 --> 00:00:56,559
desires or until their tummies are full and
then you can sell the other fish to make

9
00:00:56,559 --> 00:01:03,220
money.  But why?
Well if you buy twice as many fish, or

10
00:01:03,220 --> 00:01:07,030
catch twice as many fish you will have
money and you will enrich your family'

11
00:01:07,030 --> 00:01:16,840
lives! Okay but why would I want to do
that? Oh then you can make more money to

12
00:01:16,840 --> 00:01:24,189
buy an extra fishing rod.  Okay so if I have
an one extra fishing rod but why do I

13
00:01:24,189 --> 00:01:28,479
want the one extra fishing rods you can
catch more fish than God feed your

14
00:01:28,479 --> 00:01:33,270
family what if I sleep over you can sell
it make some money

15
00:01:33,270 --> 00:01:38,950
and what do I want to do with money
well if you have more money and

16
00:01:38,950 --> 00:01:43,180
eventually then you can buy a boat I
mean you thought about empty fishing

17
00:01:43,180 --> 00:01:48,009
rods and you can get people to work for
you and you can start a big business and

18
00:01:48,009 --> 00:01:53,469
you start any easily create and then the
rest yourself to make more money and

19
00:01:53,469 --> 00:02:00,759
then why then you can employ lots more
people you can borrow fleet of ships and

20
00:02:00,759 --> 00:02:05,170
you some guy up every day and catch fans
and fans in times of fish and make loads

21
00:02:05,170 --> 00:02:11,350
of money tons of money that sounds ok
but why would I want to do that well

The subtitles are continuous. Luckily there are some punctuations that may help split. But I've got some other subtitle files there are no punctuations. Anyway, I can use some neural models to insert the punctuations. I just want to make sure splitting text is not a must-to-do thing when using the ctc segmentation.

As for ASR OOM, I can use a streaming model to store the output logits so this should be alleviated.

Please safely close this question if my above understanding is correct.

lumaku commented 1 year ago

In this case, you don't need to split your text into utterances, because you already have the partitioning in the subtitle file - One subtitle is one utterance. For you that would be (one utterance per line):

so you went on holiday. I went on a beach holiday and I went jogging every single
morning and this poor fisherman I just saw him sitting there catching his one
little fish and then he goes home to feed his family day after day. I thought to myself I
can enrich this guy's life, he's leading this simple simple life and I'm obviously
used to all the luxuries and I suggested why don't you catch a few fish every single
day, why only one?  So the fisherman I think replied, but why should
...

Using a streaming model is a good idea. I have no experience with the alignment accuracy of streaming models. After aligning, you should check whether the timings are correct. Have fun!

Slyne commented 1 year ago

Now I’m confused about the “utterance” definition. I thought there should be some pause/silence between utterances. In the subtitles above, the start of the next subtitle is the same as the end of the last subtitle. This seems to mean they are of the same utterance.

I just want to use the whole subtitles without splitting to align with the audio. Then based on the timings and punctuations, I can split the long audio files into segments.

On Wed, Feb 8, 2023 at 5:48 AM Ludwig Kürzinger @.***> wrote:

In this case, you don't need to split your text into utterances, because you already have the partitioning in the subtitle file - One subtitle is one utterance. For you that would be (one utterance per line):

so you went on holiday. I went on a beach holiday and I went jogging every single morning and this poor fisherman I just saw him sitting there catching his one little fish and then he goes home to feed his family day after day. I thought to myself I can enrich this guy's life, he's leading this simple simple life and I'm obviously used to all the luxuries and I suggested why don't you catch a few fish every single day, why only one? So the fisherman I think replied, but why should ...

Using a streaming model is a good idea. I have no experience with the alignment accuracy of streaming models. After aligning, you should check whether the timings are correct. Have fun!

— Reply to this email directly, view it on GitHub https://github.com/lumaku/ctc-segmentation/issues/26#issuecomment-1422627555, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP63VG2IZVSMY3WYBO2KBTWWOP4LANCNFSM6AAAAAAUTNISIM . You are receiving this because you authored the thread.Message ID: @.***>

lumaku commented 1 year ago

If there is a pause between two spoken sentences, they should be separated into two separate utterances. Hoever, if the pause between them is very short, these two sentences don't need to be separated into two utterances.

In that regard, CTC segmentation is different to other forced alignment algorithms: It can skip pauses of arbitrary length between utterances. However, it's not designed for skipping longer pauses within an utterance. As the utterance annotation within the label sequence is done at text preprocessing, you should split your text into utterances already before you align the audio to the text.

I can recommend the ESPnet 2 alignment example as a small toy example, so that you can try out various configurations and different ways to split the text into utterances.

Slyne commented 1 year ago

I understand the process of doing alignment with this tool from ESPNET2.

However, it's not designed for skipping longer pauses within an utterance. As the utterance annotation within the label sequence is done at text preprocessing, you should split your text into utterances already before you align the audio to the text.
  1. Just couldn't understand why if there's longer pauses, we should split the text..Is it due to the longer pauses, the tool would start to align the next sentences?(only guess) I see from paper this tool is for utterance-wise text and audio. But I can't figure out why..

  2. If I split the text into sentences (sent1, sent2, ...sentN) and align them with the long audio, will there be duplicate computation, for example, after aligning sent1 with the audio, sent2 is used to align from the beginning of the long audio again ? Or it will continue the last step and start to align sent2 with the last ending position in the audio ?

Sorry for so many questions ...

lumaku commented 1 year ago

If you have multiple utterances in the audio, then include all of them at one alignment run.

To get a better understanding of the algorithm, I can recommend the Pytorch forced alignment tutorial: https://pytorch.org/audio/main/tutorials/forced_alignment_tutorial.html This CTC segmentation package is very similar to the algorithm in the tutorial, but it has an additional preprocessing and a feature that allows it to skip pauses between utterances. In the paper, this was done for skipping any unrelated preambles. For example, if you look in the intermediate variables of the alignment steps (from the CTCSegmentationTask object) in Espnet: Try to omit a part of the sentence and add an utterance separator instead, you can investigate how that changes the label sequence and the timings.

Slyne commented 1 year ago

If you have multiple utterances in the audio, then include all of them at one alignment run.

To get a better understanding of the algorithm, I can recommend the Pytorch forced alignment tutorial: https://pytorch.org/audio/main/tutorials/forced_alignment_tutorial.html This CTC segmentation package is very similar to the algorithm in the tutorial, but it has an additional preprocessing and a feature that allows it to skip pauses between utterances. In the paper, this was done for skipping any unrelated preambles. For example, if you look in the intermediate variables of the alignment steps (from the CTCSegmentationTask object) in Espnet: Try to omit a part of the sentence and add an utterance separator instead, you can investigate how that changes the label sequence and the timings.

Thank you. I believe not splitting the text should be ok.

Try to omit a part of the sentence and add an utterance separator instead, you can investigate how that changes the label sequence and the timings Got it!

The pytorch forced alginment seems to just do a forward pass (instead of adding, it uses maximum) and use viterbi decoding to do backtrack. Not quite sure why this is related to my question of 'why segmenting the text is necessary'. ><... Maybe it's just I didn't explain the question clear enough.

I did get an answer from others. The design of CTC Segmentation tool is for the downstream task such as ASR, which uses utterances to do training. (You've also mentioned this.) The CTC segmentation tool will give confidence scores to each of the sentences and we can remove the utterances with low confidence score. Therefore, if I don't split the text first and just use the long text to do alignment, it's probably hard for me to do filtering and segmenting later. This answer is convincing.

Really appreciate your help!