readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.44k stars 218 forks source link

Recognising a Sentence and Returning Start & End Times #299

Open digiphd opened 1 year ago

digiphd commented 1 year ago

Hi guys,

Great tool! I am wondering if you could give me some pointers in improving this function, it works however I find the timing information as out significantly.

I am running on a Macbook Pro M1 and a Macbook Mini M1, both with the same issue.

Here is my problem:

Say I have a python list with 3 sentences that I know exist in the audio file. Here is an example list item:

well start with planet arrakis the primary setting for the dune series arrakis is known for its desertlike terrain which is full of dangerous creatures well talk about some of the more dangerous and fascinating creatures including wormlike sandworms and the mysterious tleilaxu face dancers

I strip out all punctuation and grammar to keep it as strings.

I then iterate over them and use the Levenshtein distance to find phrases that match the fragment leaves like this:


    tmp_audio_file = './instance/media/tmp_audio.wav'
    sync_map = './instance/syncmap.srt'
    subprocess.run(["ffmpeg", "-y", "-i", input_video, "-vn", "-acodec", "pcm_s16le", tmp_audio_file])
# create a Task object
    config_string = "task_language=eng|is_text_type=plain|os_task_file_format=srt"

    task = Task(config_string=config_string)

    task.audio_file_path_absolute = absolute_path_audio.encode('utf-8')
    task.text_file_path_absolute = absolute_path_script.encode('utf-8')
    task.sync_map_file_path_absolute = absolute_path_syncmap.encode('utf-8')
    task.sync_map_file_path = sync_map.encode('utf-8')

    ExecuteTask(task).execute()

   for phrase in interesting_points:
        phrase = re.sub(r'[,\.\'\-\’]', '', phrase)

        for fragment in task.sync_map_leaves():

            if fragment.text.lower():

                if distance(phrase.lower(), fragment.text.lower()) <=40:

                    start_time = fragment.begin
                    end_time = fragment.end

                    if start_time:
                        segments.append([start_time, end_time])

It gives an output with three arrays of start times and finish times. [[TimeValue('12.520'), TimeValue('37.040')], [TimeValue('59.920'), TimeValue('82.760')], [TimeValue('82.760'), TimeValue('82.760')]]

Which looks roughly as I would expect, except upon closer inspection, the timing is out by a significant amount, and the last found list item is capped right at the end of the audio file even though it is found.

So perhaps it is more a matter of how I am configuring the task.

Do you have any ideas? Or perhaps there is a better way to approach this?

I should also mention that the spoken audio is generated from text-to-speech (polly) from a script in a .txt file.