Output TTS audio out of sync compared to the original video and .srt file

luckyluca commented 1 month ago

Overall it works well, thank you for Pandrator!

I have a video with .srt subtitles and am using Pandrator (XTTS) to dub it. Except that the audio goes gradually out of sync compared to the subtitles track. It happens on long as well as short videos, right after the second of third sentence, you can hear the audio coming sooner and going more out of sync. Then there is a very long pause of minutes, and then the audio starts again in sync.

Screenshot 2024-10-18 221758 pandrator_20241018_191636.log

What are the recommended tips to keep the audio true to the .srt lines and timing? Is there a way to increase keyframes / milestones, so that Pandrator tries to sync to the .srt timing more frequently, let's say every 10 seconds?

lukaszliniewicz commented 1 month ago

I have not tested the video + existing .srt workflow very much, I focused on the full workflow (transcription, translation, TTS, synchronisation). I will have to look into it. In the meantime, please try the full workflow. DEEPL API is free, so you can test the translation with that. Even with DEEPL, this should give you much better subtitles than what you have now (bad punctuation, generally subpar translation). And I will try to do more tests with video+srt. Could you perhaps send me the video and the srt file that you're working with now? You can upload them here: https://1drv.ms/f/s!AgSiDu9lV3iMnoc6z92bdIyqZ05kAQ?e=Mu6vKw.

JohnF51 commented 1 month ago

In my experience with dubbing videos, and I've done quite a few of them, you need to raise the speed to 1.2 and set these Appended silence values to 100 and the paragraph length to 200. You also need to pay attention in the advanced TTS setting section, especially the lenght penalty section, to use minus values when you want to stretch the length of a sentence and plus values when you want to shorten it. All values need to be experimented because each language has different characteristics. I do this for a few times usually 2-3 times and I have good results. The important thing is to watch the original srt and clean the translated srt from errors and then have it generate the voice and only after correcting the voice give the Add to dubbing videos afterwards. Good luck.

lukaszliniewicz commented 1 month ago

Thanks for sharing your experiences! The normal/paragraph silence should not be appended at all in the dubbing workflow. I will look into that. I've spent most time validating the fullw workflow - from video to translated subtitles, to synchronised dubbing. I will have to test the existing srt workflow more thorougly.

For very fast-paced videos, it may be a good idea to increase the speed of generated audio, yes. The synchronisation works like this: if generated audio is shorter than the time between the start of one subtitle and the start of the following one, silence is used to synchronise it. If it is longer, the algorithm keeps track of how much and tries to self-correct by reducing or eliminating the shift if some segments are shorter than their alloted time. I will incorporate an automatic speed increase for the longer segments, probably something conservative as 1.1 or 1.15 at most, to help make it more natural.

JohnF51 commented 1 month ago

Thank you for doing this and I am looking forward to the update and I also want to thank you for version 0.26 I have tested it and it works perfectly. I'm doing research on other possibilities and maybe if I can find some other working project I'll alert you if you would implement it in your program. These would be great possibilities:

For dubbing, the ability to choose voices for different characters and this also applies to audiobooks.
Mouth opening based synchronization for certain types of videos.
Choice of emotion generation options: sad, angry, happy, bored, calm, whispering, etc.

I believe these options will appear soon, as many are already working on it. I recommend it if you haven't seen Jarod's project yet. Thanks again for your work, I already have various modifications of your work and if I improve anything I'll let you know. Jarod git

lukaszliniewicz commented 1 month ago

Thank you for the suggestions. Implementing a robust speaker attribution workflow to easily produce "audio dramas", basically, is not trivial. I think Jarod is doing it manually, allowing the user to choose a different voice for different sentences, which would be relatively easy to do, but requires logic to separate speech (dialogue - "I'm happy") from speaker attribution (..., she said.) You can use NLP processing for that, and I believe there is an app that uses booknlp for that purpose, but I haven't tested it yet and don't know how well it works. Ultimately I want to train a small LLM model on text-to-JSON processing, including dialogue and speaker attribution, and use that for this purpose, though it is not a priority right now as, personally, I prefer regular audiobooks, with just a narrator. As for emotions, XTTS does it reasonably well, though of course it will get confused when it comes to more complex utterances. But for basic emotions it's really rather good. And a lot depends on the sample voice you're using - it needs to have some dynamism to work well for that purpose.

What do you mean by "mouth opening based synchronisation"?

JohnF51 commented 1 month ago

At eleven labs and kapwing offers video synchronization based on this technology so far I have not been able to get further but I think it uses deep-fake technology. Or AI image analysis, I don't know I'm still researching in this area. I'm sending a sample video that was made through eleven labs, the original audio was English.

https://github.com/user-attachments/assets/a5a958b5-e625-4961-bbfe-fc73973353c4

Through AI I found a way to use the library of dlib

1.Detecting Face and Lips First, we need to detect the face and then the lips in each frame of the video. We can use the dlib library, which provides models for face and landmark detection.

2.Analyzing Lip Movement After detecting the lip landmarks, we can analyze their movement. We can use the distances between certain points on the lips to determine when the lips are open or closed.

3. Synchronizing Audio To synchronize the audio with the lip movements, we can use timestamps of when the lips are open or closed and adjust the audio accordingly using the pydub library.

I haven't tried it yet, but I'll be interested in it, if I find anything I'll let you know.

JohnF51 commented 1 month ago

I like XTTS very much, maybe if your program would add buttons to generate basic emotions to express a sentence it would be great. It would save time when creating an audio book. Using your program I am creating this audiobook, thank you. I'm still looking for a way to train Slovak language but unfortunately I haven't made any progress.

lukaszliniewicz commented 1 month ago

Thanks, this is interesting, but I don't have time to investigate the lip-syncing at the moment. My goal was to provide a narrator-type dubbing for now. Next, I want to focus on improving PDF parsing. As for emotions, there is no simple way to control them for XTTS. The model "understands" some of the context and it does render emotions to some degree. Here is a sample I made some time ago: https://sndup.net/3jvnf/. But a lot depends on the voice sample here, and it works best for English, probably (simply because there was more English data in the dataset). What could be done, perhaps, though I would like to begin with something simpler, like detecting hallucinations and artifacts, would be to train an audio classification model or something like that, that wuild run in parallel to TTS generation on the CPU and mark sentences for regeneration automatically.

Here is an example of creating dubbing (from a video, but I checked, and synchronisation works also if you choose SRT as input - but you need to provide a video as well and let it do the alignment - it won't work if you use "save output" and then try to add the file as another audio track!):

pandrator_dubbing_full_workflow.webm

lukaszliniewicz commented 1 month ago

PS. After you run "Add dubbing to video", there will be a wav file in the session folder with "aligned" in its name. You can use this if you want to do the mixing yourself - this is the synchronised speech.

luckyluca commented 1 month ago

The translation and dubbing of that video looks cool.

However, for me, the ability to use external .srt files is vital, because it allows me to edit them externally using aegisub quicker than it would using Pandrator.

Please find the long video + english sub titles here below (The video is a random one I found on the net): https://we.tl/t-hBDfwtY3kN

Thanks for updating the SW so quickly, however, Update from the .exe doesn't work, it returns an error. Screenshot 2024-10-19 214213

I ended up running E:\Pandrator\conda\Scripts\conda.exe run -n pandrator_installer and then pip install -r E:\Pandrator\Pandrator\requirements.txt

However, I decided to scrap it all and redownload the .exe and v26 ( downloading now )

Are you sure the Pandrator ecosystem is portable? and I don't need any software/dependency/module installed?

lukaszliniewicz commented 1 month ago

It should be fully portable, yes, though it's quite possible that I forgot about something. Let me know if there is a problem.

Using srt files works for me. The thing is that you need to align them, and not just use the save output option. Here is how this works (the result is not great, because the subtitles weren't great):

https://github.com/user-attachments/assets/dfd19868-c6fb-4e79-a81c-35cab01e7396

You need to select the video and run "Add Dubbing to Video" after you edited your srt file. If you want to mix the audio yourself, there will be a wav file with "aligned" in the name in the session folder.

luckyluca commented 1 month ago

I need to resolve the fact that I'm getting errors when updating and when running the subdued part. Do you have libraries or tools installed outside of Pandrator? what python version do you have installed on your system? I currently have no python installed and am running Windows10 on a HP Z840 workstation if that helps.

lukaszliniewicz commented 1 month ago

System Python should not matter at all, everything ought to be handled inside and between conda environments set up by Pandrator. As for the transcription part, the error was related to Pascal GPUs (like the 1080), which don't fully support float16 compute. It should be fixed now (it defaults to int8 for these gpus). If you're still encountering an error, please let me know what it is exactly and when it occurs.

Does updating work when you use version 0.26?

luckyluca commented 1 month ago

yes I'm afraid. Update doesn't work. Screenshot 2024-10-19 232838

luckyluca commented 1 month ago

I loaded a shorter .mp4 file in Italian and selected translate from Italian to English. Pandrator froze after a while. If you see the log, it has to do with subdud I think. The test folder contains the original mp4 and the .srt in Italian. Check out the log below: pandrator_20241019_233209.log

lukaszliniewicz commented 1 month ago

I've made a small change in the update code, please download the exe from the releases and replace the one you have now with it. Hopefully it will work.

As for the dubbing, I think the problem is that you haven't set the API key variable for Anthropic (it looks like you chose sonnet). If you want to use DeepL, which is free up to 500000 characters per month, please select it and set the api key under "Api Keys" in Pandrator.

JohnF51 commented 1 month ago

Hi

I would like to ask you about .srt and I have videos with short sentences and when generating text the program tries to join these short sentences but it doesn't work for me. Isn't there a way to preserve the block structure that .srt uses? For example: 01 Hi 02 my name is Jarda 03 Today we are going to solve this. 04 We will proceed as follows.

When I have a srt like this, when I generate it, the program tries to concatenate the sentences. And this is what happens: 01 Hi, my name is Jarda, today we will solve this. Here's what we're going to do.

I don't know if you understand what I mean. I tried to set General text processing settings but I couldn't change it. Isn't it possible to modify it somewhere in the pandrator code? Would the text generator keep the srt lines exactly as they are? Even if the line is extremely long? Or extremely short?

lukaszliniewicz commented 1 month ago

I guess what is happening is that in this case subtitle n's end is the same as subtitle n+1's start. Subdub will join subtitles within 1ms of each other, because, in my experience, it makes the generated speech more natural. To be sure, could you send me an example, an srt file and a video? I should be able to include an option to turn this behaviour off, but I need to know what is happening exactly.

JohnF51 commented 1 month ago

Thank you

luckyluca commented 1 month ago

I've made a small change in the update code, please download the exe from the releases and replace the one you have now with it. Hopefully it will work.

As for the dubbing, I think the problem is that you haven't set the API key variable for Anthropic (it looks like you chose sonnet). If you want to use DeepL, which is free up to 500000 characters per month, please select it and set the api key under "Api Keys" in Pandrator.

Screenshot 2024-10-20 171024 Updating returns the same error as before (using v26 and latest .exe).

P.s. DeepL works! But I can see the same offsets where the audio is slower and goes out of sync with the .srt displayed. This both using existing .srt and .srt created in Pandrator.

lukaszliniewicz commented 1 month ago

There will be an offset sometimes, yes, because the final subtitles (with equalized in the filename) are not the subtitles produced by Whisper exactly (these tend to be too long), they are post-processed to conform to subtitle standards (one line of text etc). To use the original Whisper subtitles, select the file without "equalized". So a few subsequent subtitle lines may have been one subtitle, so we're reading it as one, and sometimes it will be faster or slower depending on how the TTS speech rate compares to the original speech rate. It will be corrected at the next subtitle (if running ahead) or the next pause (when running behind).

JohnF51 commented 1 month ago

Thank you for the explanation. That's exactly right. I was working with direct srt not equalized, that was a mistake. I forgot to say that I created a program that will edit the subtitles via whisper and translated via deepl so that I have less work to do when cleaning the generated text. Because I discovered that in my language if I add a dash after each ending sentence, the error rate of the generated artifacts after the sentence is significantly reduced. Please check if this also works for you. I'll try to look at the equalized .srt subfile, because I've been using .srt only the one directly translated via deepl. I would like to ask if you could change in the pandrator.py code to make the generated or marked text wrap, sometimes I use quite long sentences so I can't see the end of the line, also the font size. I did change it in the code to add a horizontal offset, but I have to change it after every update. Thanks again.

lukaszliniewicz commented 1 month ago

If you weren't using equalized subtitles, then the misalignment simply results from the generated audio being longer or shorter than original speech. Subud self-corrects as soon as possible, but there will be some misalignment at times because I think we should prioritise natural speech over exact word-to-word alignment. This is also the case in professional narration / dubbing with one narrator. And to achieve naturally sounding speech with TTS, we should try to give the model claues, or logical sentence segments, not randomly chopped-up fragments, because if you put those together later, they will sound, well, choppy and very artificial. Significant misalingment may occur when the original is very fast paced and the TTS audio cannot keep up - if there are no pauses, or only very short onse, it may not "recover" fully. In that case it's necessary to regenerate with a higher speed setting or a voice sample with fast speech (XTTS will take the speech rate from the sample, generally). I will try to automate this to a certain degree. Some people just apply a speedup so that generate audio always fits inside subtitle timings, but in my experience this leads to bad results. I prefer to have some misalignment and have the algoritm self-correct it if possible. But we could do something like "if the genrated audio exceeds the subtitle time, regenerate it with 1.15 speed setting", or something like it. This would reduce the misalignment. Or, I could introduce a "preview mode" to generate and align only the first x minutes of the video to make it easier to fine-tune speed settings.

I will try to add wrapping or a horizontal scrollbar. But I'm a bit surprised that your sentences are that long. Have you increased the 160 character limit? Pandrator should break down sentences over 160 characters - first at the punctuation mark or conjuncion nearest to the middle, and then again for the parts if they are still longer than 160 characters, to try and preserve natural flow of speech (which works very well if we feed the model clauses).

JohnF51 commented 1 month ago

Yes I also use longer sentences, I found (as I wrote) that if you add a comma after a sentence that ends with a period you can generate much longer text. The only bug I manually remove is the occasional repetition of a word. But that doesn't bother me because I do postprocessing via Adobe audition after output. Sometimes it is better to use a long sentence. Or even several sentences in a row. As for the audiobook. And also if I increase the font size which of course will show that 160 characters won't fit on the screen. I also changed the color of the selected text to yellow and the generated text to dark green background and the marked text to dark red bacground but that's just my adjustment. Thank you for the quick reply and the correct direction on how to do more accurate dubbing.

lukaszliniewicz commented 1 month ago

A few short sentences together works better, yes, which is why I have the "append short sentences" option, but in my experience the quality deteriorates quickly beyond 160 characters, and definietly 200, but it may depend on the language the the reference voice. Have you tried training a voice? Pandrator automatically removes periods before senting the text for TTS (I should add it as an option, including replacing periods with commas, why not).

JohnF51 commented 1 month ago

Yes I tried the training voice and the result is great. I'm looking forward to the next update but even if it wasn't it's already a great program. Thank you

lukaszliniewicz commented 1 month ago

I'm working on a new feature, and for that I will have to change the listbox to another widget anyway, so I hope to solve this problem then.

lukaszliniewicz / Pandrator

Output TTS audio out of sync compared to the original video and .srt file #54