ccaligner"]

Word by word audio subtitle synchronization (forced alignemnt) tool and API. Developed under Google Summer of Code 2017 with CCExtractor.

(https://saurabhshri.github.io/)

[link=https://www.youtube.com/watch?v=38_27E1PxXA] image::https://raw.githubusercontent.com/saurabhshri/CCAligner/master/docs/demo.gif[align="center"]

The project is in it's very early stage and is constantly evolving. The available functions, usage instructions et cetera are expected to refactor over time. It is not production ready but you are welcome to play with it, or better, help improve it! :)

== Using CCAligner

CCAligner can be used as both standalone tool or a library in your own project.

=== Installing Dependencies ===

To automatically generate language models, dictionaries and grammars, following dependencies need to be met. The tool has capability to generate them without these dependencies, but the accuracy in that case is not guaranteed. It is highly recommended to work with the dependencies installed.

cmuclmtk : to generate vocab and LM. (install/dependencies/cmuclmtk-0.7.tar.gz)
g2p-seq2seq : to generate dictionary. (install/dependencies/g2p-seq2seq)

You will also need to install http://www.perl.org/get.html[Perl] and move install/quick_lm.pl to the same directory as the CCAligner or a directory that is set in the environment variable PATH.

Steps :

Linux/MacOS

To install cmuclmtk :

Navigate to install/dependencies directory and uncompress cmuclmtk-0.7.tar.gz while preserving the permissions :

tar xvpzf cmuclmtk-0.7.tar.gz + Original download link : (https://sourceforge.net/projects/cmusphinx/files/cmuclmtk/0.7/cmuclmtk-0.7.tar.gz/download)
Navigate to cmumltk-0.7 directory :

cd cmuclmtk-0.7
Install :

./configure make sudo make install

You may have to run sudo ldconfig to fix errors such as missing shared library.

To install g2p-seq2seq :

The tool requires TensorFlow at least version 1.0.0. Please see the installation https://www.tensorflow.org/install[guide^] for details. If you are on Linux (x86_64), you may directly run the following :

sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0-cp27-none-linux_x86_64.whl + Note : To use g2p with latest versions of TF, download the up-to-date repository from (https://github.com/cmusphinx/g2p-seq2seq). Please note that the recent g2p version brings changes that break few things in CCAligner, so using the supplied version is recommended.
Navigate to install/dependencies directory and uncompress g2p-seq2seq.zip while preserving the permissions :

unzip g2p-seq2seq-master.zip
Navigate to g2p-seq2seq-master directory and run :

sudo python setup.py install

The alternate ways of generating language models, dictionaries and grammars are covered later in the docs.

Windows

To install cmuclmtk :

Navigate to install/dependencies directory and uncompress cmuclmtk-0.7.tar.gz. Build it with the cmuclmtk.sln it provides.
Copy the compiled files (wfreq2vocab.exe, text2wfreq.exe, text2idngram.exe, idngram2lm.exe) to the same directory as the CCAligner or you can set the directory contains these files to environment variable (PATH)

To install g2p-seq2seq :

First, install Python 3.5 (64-bit) and Tensorflow 1.0.0 by your preferred choice of method
Navigate to install/dependencies directory and uncompress g2p-seq2seq-master.zip
Navigate to g2p-seq2seq-master directory and run :

python setup.py install

=== Before You Run ===

Please make sure you have all the dependencies installed in case you want to use grammar tools. To disable generating grammar by CCAligner, issue --generate-grammar no.
Make sure the model folder and g2p-seq2seq-cmudict are in the directory where you are compiling CCAligner.
Make sure the subtitles are clean and are in proper SRT format.
The wav file should be 16 bit PCM mono sampled at 16KHz. To generate the wav file using a video through FFmpeg, you may :

./ffmpeg -i input.video -bits_per_raw_sample 16 -ar 16000 -ac 1 output.wav

=== Installing ===

Linux/MacOS

Clone the repository from Github using :

git clone https://www.github.com/saurabhshri/CCAligner.git
Navigate to install directory and run build.sh.

cd install/ ./build.sh
Align!

./ccaligner

Windows

Clone the repository from Github using :

git clone https://www.github.com/saurabhshri/CCAligner.git
Use CMake to generate project files, and then build it.
Align!

.\ccaligner

=== Quick Demo ===

The default output of CCAligner is stored as an XML file. For example, the next command will generate file.xml :

./ccaligner -wav /path/to/file.wav -srt /path/to/file.srt

Generated Output Snippet :

.
.
<subtitle>
    <start>12780</start>
    <dialogue>I was offered a summer research      fellowship at Princeton.    </dialogue>
    <edited_dialogue>I was offered a summer research fellowship at Princeton</edited_dialogue>
        <words>
            <word>
                <recognised>0</recognised>
                <text>I</text>
                <start>12780</start>
                <end>12911</end>
                <duration>131</duration>
            </word>
            <word>
                <recognised>1</recognised>
                <text>was</text>
                <start>13030</start>
                <end>13330</end>
                <duration>300</duration>
            </word>
            <word>
                <recognised>1</recognised>
                <text>offered</text>
                <start>13400</start>
                <end>13770</end>
                <duration>370</duration>
            </word>
            .
            .
            .
        </words>
    <end>16382</end>
</subtitle>
.
.

=== API or Library usage ===

Clone the repository from Github :

git clone https://github.com/saurabhshri/CCAligner.git
Place the CCAligner folder in appropriate directory in your project.
In your project, simply include the directories and source file you wish to use. You may refer to CMakeLists.txt in the src/ directory to get an idea. The CCAligner tool is built around the CCAligner API.

For example : If you want to use the audio based alignment in your project


//include the header file
#include "recognize_using_pocketsphinx.h"

//Declare the aligner
PocketsphinxAligner * aligner = new PocketsphinxAligner(_parameters);

//Align
aligner->align();

//Print the result
aligner->printAligned("Manual_Printing.json", json);

//delete the aligner
delete(aligner);

Complete documentation of the API will be written in docs.

=== Some Previews ===

Click on video thumbnail or link to watch the video on YouTube.

[cols="1,5"] |=== a| [link=https://www.youtube.com/watch?v=38_27E1PxXA] image::https://img.youtube.com/vi/38_27E1PxXA/0.jpg[height = "100px"] | Word by Word Audio Subtitle Synchronization - Karaoke Demo 1

_(https://www.youtube.com/watch?v=38_27E1PxXA)_

[Sitcom]

a| [link=https://www.youtube.com/watch?v=6VnhC8u_d40] image::https://img.youtube.com/vi/6VnhC8u_d40/0.jpg[height = "100px"] | Word by Word Audio Subtitle Synchronization - Karaoke Demo 2

_(https://www.youtube.com/watch?v=6VnhC8u_d40)_

[Ted Talk]

a| [link=https://www.youtube.com/watch?v=j_zeixo-zJY] image::https://img.youtube.com/vi/j_zeixo-zJY/0.jpg[height = "100px"] | Word by Word Audio Subtitle Synchronization - Karaoke Demo 3

_(https://www.youtube.com/watch?v=j_zeixo-zJY)_

[Cartoon Show]

a| [link=https://www.youtube.com/watch?v=8tTDX6NZGsU] image::https://img.youtube.com/vi/8tTDX6NZGsU/0.jpg[height = "100px"] | Word by Word Audio Subtitle Synchronization - Karaoke Demo 1

(https://www.youtube.com/watch?v=8tTDX6NZGsU)

[Discussion Video]

a| [link=https://www.youtube.com/watch?v=tFrf0TVnqIQ] image::https://img.youtube.com/vi/tFrf0TVnqIQ/0.jpg[height = "100px"] | Word by Word Audio Video Transcription Demo

(https://www.youtube.com/watch?v=tFrf0TVnqIQ)

[Reality Show]

a| [link=https://www.youtube.com/watch?v=km1iHe_mGuo] image::https://img.youtube.com/vi/km1iHe_mGuo/0.jpg[height = "100px"] | Approximate Word by Word Audio Subtitle Synchronization

_(https://www.youtube.com/watch?v=km1iHe_mGuo)_

|===

== Usage Parameters ==

The following is a complete list of available parameters that can be passed to CCAligner. Feel free to open a PR if you spot a missing parameter.

Input related parameters :

[cols="2,2,4"] |=== | Parameter | Accepted Values | Description

|-wav |/path/to/wav_file |Provide path to input audio wave file. Wave file must be 16 bit PCM mono sampled at 16KHz.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt

Required : yes.

|-srt |/path/to/subtitle_file |Provide path to subtitle file in SRT format. Please ensure that the subtitle file is clean and in proper format.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt

Required : yes.

|-stdin or - |Audio wave file from stdin or pipe. |Use this parameter to pass wav file from stdin or pipe.

E.g.: cat tbbt.wav \| ccaligner -stdin -srt tbbt.srt |===

Output related parameters :

[cols="2,2,4"] |=== | Parameter | Accepted Values | Description

|-out |/path/to/output_file |Provide name and path to generated to output file. By default the output name is extracted from input file and generated in same location in which the input file is located.

_E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -out my_output.xml_

|-oFormat |xml, json, srt, karaoke, stdout |To choose output format. By default the output format is XML.

_E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -out output_as_karaoke.srt -oFormat karaoke_

|-log |/path/to/aligner_log_file/ |Specify path to logfile for PocketSphinx decoder. By default stores log in tempFiles/{execution_timestamp}.log

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -log tbbt.log

|-phoneLog |/path/to/phoneme_log_file/ |Specify path to logfile for PocketSphinx phoneme decoder. By default stores log in tempFiles/phoneme_{execution_timestamp}.log

_E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -phoneLog tbbt_phoneme.log_ |===

Alignment related parameters :

[cols="2,2,4"] |=== | Parameter | Accepted Values | Description

|-approx |yes, no |Use approx aligner instead of audio based aligner. Calculated timing of words based on it's weight. Super fast and doesn't involve audio analysis. Please be aware the result is not accurate but approximate.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -approx yes

|--enable-phonemes |yes, no |Recognise and find phonemes and their timestamps along with words. SRT and Karaoke output can not display phonemes.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --enable-phonemes yes

|-transcribe |yes, no |Performs transcription of complete audio instead of searching using timestamps and subs. Use this when timings in subtitles are incorrect or you want YouTube like transcription of video.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -transcribe yes

|--use-fsg |yes, no |Instruct CCAligner to follow Finite State Grammar while performing recognition.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --use-fsg yes

|-useBatchMode |yes, no |Instruct CCAligner to use batch mode of PocketSphinx. May improve accuracy by flushing CMN values.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -useBatchMode yes

|-experiment |yes, no |Use experimental parameters. May improve accuracy in some cases.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -experiment yes

|-searchWindow |An integer |Determine the extent to which current recognised word is searched in the respective subtitle dialogue. Default value is 3.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -searchWindow 6

|-audioWindow |An integer |Determine the frontal and rear window from current subtitle timing to perform recognition. The value should be in milliseconds. Default value is 0.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -audioWindow 500

|-sampleWindow |An integer |Determine the frontal and rear window from current subtitle timing to perform recognition. The value should be in number of samples. Default value is 0.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -sampleWindow 500 |===

Grammar, Language Model related parameters :

[cols="2,2,4"] |=== | Parameter | Accepted Values | Description

|--generate-grammar |yes, no, onlyCorpus, onlyDict, onlyFSG, onlyLM, onlyVocab |Parameter deciding if and which type of grammar/lm to be generated. Once you have generated these files, no need to generate them again. They are stored in tempFiles/{respective_dir}. Also, use this when supplying files manually.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --generate-grammar no

|-model |path/to/acoustic/model |Enter path of acoustic model to be used by aligner. Accuracy highly depends on the acoustic model.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -lm custom.lm

|-lm |path/to/language/model |Enter path of language model to be used by aligner.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -lm custom.lm

|-dict |path/to/dictionary |Enter path of dictionary to be used by aligner.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -dict custom.dict

|-fsg |path/to/fsg/directory |Enter path of the directory containing FSGs, each FSG with name as starting timestamp of dialogue.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -fsg fsg/

|-phoneLM |path/to/phonetic/language/model |Enter path of phonetic language model to be used by aligner.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -fsg fsg/

|--quick-dict |yes,no |Generate dictionary quickly without using TensorFlow and seq2seq. Result might not give best accuracy.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --quick-dict yes

|--quick-lm |yes,no |Generate language model quickly without using cmuclmtk. Result might not give best accuracy.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --quick-dict yes |===

Display related parameters :

[cols="2,2,4"] |=== | Parameter | Accepted Values | Description

|-verbose |yes, no |Turns verbosity on and off. Turn off for preventing [info] logs.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt -verbose no

|--display-recognised |yes, no |Determine whether to display the recognised words and matching status on stdout or not.

E.g.: ccaligner -wav tbbt.wav -srt tbbt.srt --display-recognised no

|===

== Project Details ==

The usual subtitle files (such as SubRip) have line by line synchronization in them i.e. the subtitles containing the dialogue appear when the person starts talking and disappears when the dialogue finishes. This continues for the whole video. For example :

1274
01:55:48,484 --> 01:55:50,860
The Force is strong with this one

In the above example, the dialogue #1274 - The Force is strong with this one appears at 1:55:48 remains in the screen for two seconds and disappears at 1:55:50.

The aim of the project is to tag the word as it is spoken, similar to that in karaoke systems.

E.g.

The           [6948484:6948500]
Force         [6948501:6948633]
is            [6948634:6948710]
strong        [6948711:6949999]
with          [6949100:6949313]

In the above example each word from subtitle is tagged with beginning and ending timestamps based on audio.

Important Links

Project link on official GSoC web-app : https://summerofcode.withgoogle.com/projects/#5589068587991040
Project repository on Github: https://github.com/saurabhshri/CCAligner
Weekly blog : https://saurabhshri.github.io
Milestones and deliverables checklist: https://saurabhshri.github.io/gsoc/
Mentors : https://github.com/cfsmp3[@cfsmp3^] and https://github.com/alexbrt[@alexbrt^]

Credits and Licensing

I haven't decided the license for the tool yet, but all the individual licenses of libraries and code used can be found under license/ directory.

I have tried my best to ensure that credit and reference is given in the source wherever it is due. In case I have missed any reference/license, firstly please accept my apology. Feel free to reach out to me and I'll be happy to correct my mistake. 🤝

Contributing

The project is under constant development, and needs a lot of brushing and bug fixes. Feel free to contribute in any way. Your contribution will be highly appreciated! 🙂

saurabhshri / CCAligner

readme

Important Links

Credits and Licensing

Contributing