readbeyond / aeneas

aeneas is a Python/C library and a set of tools to automagically synchronize audio and text (aka forced alignment)
http://www.readbeyond.it/aeneas/
GNU Affero General Public License v3.0
2.45k stars 218 forks source link

How to reduce memory usage of aeneas #224

Closed mustafaxfe closed 4 years ago

mustafaxfe commented 5 years ago

I have been trying to create a dataset for my speech recognition project. I have started to create text files in aeneas format, also cleaned it from special characters. But when I try to execute task object it crashes in Google Colab. Also, I tried it with my local ubuntu installation, and It seems it is using more than 3 gb(after some times it was using 15gb). Is it possible to reduce memory usage of aeneas, My python version 3.6 and I am working on Google Colab: My Task object and its execution:

from aeneas.executetask import ExecuteTask
from aeneas.task import Task

# create Task object
config_string = u"task_language=tur|is_text_type=plain|os_task_file_format=json"
task = Task(config_string=config_string)
task.audio_file_path_absolute = "Nutuk_sesli.mp3"
task.text_file_path_absolute = "nutuk_aeneas_data_all.txt"
task.sync_map_file_path_absolute = "syncmap.json"

# process Task
ExecuteTask(task).execute()
# output sync map to file
task.output_sync_map_file()

My audio file is approximately 700 Mb, And my aeneas file(text file) https://gist.github.com/mustafaxfe/a59485497bda74c5dbb4406f0c4a3f5c

Thanks

readbeyond commented 5 years ago

Hi,

what matters is the audio duration (in seconds), rather than the size of the file, but judging from the text, it is pretty big.

aeneas uses a SC-banded DTW algorithm, which eats an amount of RAM proportional to the length of the audio file and of the MFCC window. See: https://github.com/readbeyond/aeneas/blob/master/wiki/HOWITWORKS.md and the docs: https://www.readbeyond.it/aeneas/docs/

In your case, I would suggest breaking down the text and audio into segments of 1 hour --- it should not take that long --- and run aeneas on each segment separately. It should be easy to piece the timings back together in sequence --- there is a tool in the aeneas package for that.

Best regards,

Alberto Pettarin

On 1/24/19 12:12 AM, mustafa wrote:

I have been trying to create a dataset for my speech recognition project. I have started to create text files in aeneas format, also cleaned it from special characters. But when I try to execute task object it crashes in Google Colab. Also, I tried it with my local ubuntu installation, and It seems it is using more than 3 gb(after some times it was using 15gb). Is it possible to reduce memory usage of aeneas, My python version 3.6 and I am working on Google Colab: My Task object and its execution:

|from aeneas.executetask import ExecuteTask from aeneas.task import Task

create Task object config_string =

u"task_language=tur|is_text_type=plain|os_task_file_format=json" task = Task(config_string=config_string) task.audio_file_path_absolute = "Nutuk_sesli.mp3" task.text_file_path_absolute = "nutuk_aeneas_data_all.txt" task.sync_map_file_path_absolute = "syncmap.json" # process Task ExecuteTask(task).execute() # output sync map to file task.output_sync_map_file() |

My audio file is approximately 700 Mb, And my aeneas file(text file) https://gist.github.com/mustafaxfe/a59485497bda74c5dbb4406f0c4a3f5c

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/readbeyond/aeneas/issues/224, or mute the thread https://github.com/notifications/unsubscribe-auth/AFEodvk4iASliTj5z37KGVFCQgJ54P4oks5vGOxUgaJpZM4aP0MJ.