octimot / StoryToolkitAI

An editing tool that uses AI to transcribe, understand content and search for anything in your footage, integrated with ChatGPT and other AI models
GNU General Public License v3.0
721 stars 60 forks source link

EDIT and Translate #7

Closed BaGRoS closed 10 months ago

BaGRoS commented 2 years ago

Hi

Once speech has been transcribed into text, it should be possible to edit this text directly in the window where it is displayed. After which there should be a button at the bottom of the window: SAVE and TRANSLATE. After pressing the first one, the edited subtitles are saved, after pressing the second one they are translated by Whisper and opened in the window for further editing with the SAVE button. Saving can also be automatic, but will then unnecessarily consume disks in particular SSD.

BaGRoS

octimot commented 2 years ago

I'm currently working on the transcription editing feature, but we'll roll it out in multiple steps:

Unfortunately, the translation function cannot work the way you proposed out of the box because Whisper is using audio data to do the translations (speech-to-text). I'm not aware of other text-to-text models that are as good as Whisper is in translating a lot of the languages you throw at it, so I'd keep the translation process as it is for now. Maybe a route to consider is fine-tuning or training Whisper with your own text correction, but I'm not sure how feasible this is for the average user.

BaGRoS commented 2 years ago

Quick look https://github.com/openai/whisper/discussions/378#discussioncomment-3930225

octimot commented 2 years ago

Isn't that saying exactly what I was saying before?

BaGRoS commented 2 years ago

Yes, and text to voice should be easier even with Google voice or Microsoft (Windows), translation from that generated files should be perfect, pure voice no noises...

eg. https://github.com/bryan-brancotte/subtitle_to_speech

octimot commented 2 years ago

Sure! But I don't think there's anything to gain from doing that, since:

  1. It's still going to take 2x the time it needs to process, actually more than it needs now, because you need to do the synthesis over the Internet (transcribe -> text-to-speech synthesis -> re-transcribe)
  2. Whisper is pretty good in detecting super low quality speech
  3. The models you proposed for synthesis aren't free nor local (as far as I know)
  4. We're complicating the code which might have unforseen implications for future updates (speaker recognition etc.)
BaGRoS commented 2 years ago

Re.1. sometimes this way can be quicker Re.2. Yes, but I'm sure if can make mistakes in source language, then 100% make also mistakes in translation, so twice times for corrections Re.3. could be, Narrator inside Windows - more investigation needed Re.4. for now - agree

BaGRoS commented 2 years ago

https://github.com/openai/whisper/discussions/378#discussioncomment-3934445

octimot commented 2 years ago

Another issue is the fact that Whisper splits the phrase segments based on its internal algorithm, so translating phrases while respecting the timings of the segments would be difficult.

For eg.:

Elena. how nice to see you. what a wonderful surprise to meet you here. you're looking 
wonderful. thank you. you're looking well too. this is my good friend Dr. Heywood Floyd. 

As you see, you're looking wonderful. was split between two segments, which means that we'd either need to detect the split in order to merge the phrase before translation, or risk inaccurate translations. And if we do manage to merge the phrase, how would the segment look on the translation?

BaGRoS commented 2 years ago

Of course. I'm thinking of translating the Polish subtitles into English after the editing in the original language has been completed, so when everything is already put together into decent Polish subtitles.