srvk / eesen-transcriber

EESEN based offline transcriber VM using models trained on TEDLIUM and Cantab Research
Apache License 2.0
49 stars 14 forks source link

Alignment "doubling" #31

Open aolney opened 5 years ago

aolney commented 5 years ago

align.sh is doubling output, and the times are way off. Here is the STM, which was generated from the SRT subtitles (CC) from FFMPEG:

11  A   FakeSpeaker 3.103   5.606   and now a fireside chat
11  A   FakeSpeaker 5.606   6.607   with the creators of comedy central's south park,
11  A   FakeSpeaker 6.607   8.609   matt stone and trey parker.
11  A   FakeSpeaker 13.614  15.116  hi. i'm trey parker.
11  A   FakeSpeaker 15.115  16.617  and i'm matt stone.

Here is the ali file

1-1-S0---0025.380-0032.830 1 25.38 0.06 now
1-1-S0---0025.380-0032.830 1 25.44 0.00 a
1-1-S0---0025.380-0032.830 1 25.44 0.00 fireside
1-1-S0---0025.380-0032.830 1 25.44 1.77 chat
1-1-S0---0025.380-0032.830 1 27.21 0.09 now
1-1-S0---0025.380-0032.830 1 27.30 0.36 a
1-1-S0---0025.380-0032.830 1 27.66 0.87 fireside
1-1-S0---0025.380-0032.830 1 28.53 0.24 chat
1-1-S0---0025.380-0032.830 1 28.77 0.06 now
1-1-S0---0025.380-0032.830 1 28.83 0.12 a
1-1-S0---0025.380-0032.830 1 28.95 1.08 fireside
1-1-S0---0025.380-0032.830 1 30.03 0.63 chat
1-1-S0---0025.380-0032.830 1 30.66 0.12 now
1-1-S0---0025.380-0032.830 1 30.78 0.45 a
1-1-S0---0025.380-0032.830 1 31.23 0.39 fireside
1-1-S0---0025.380-0032.830 1 31.62 1.20 chat
1-1-S0---0032.830-0051.960 1 32.83 0.03 the
1-1-S0---0032.830-0051.960 1 32.86 0.99 creators
1-1-S0---0032.830-0051.960 1 33.85 0.12 of
1-1-S0---0032.830-0051.960 1 33.97 1.29 comedy
1-1-S0---0032.830-0051.960 1 35.26 4.26 central's
1-1-S0---0032.830-0051.960 1 39.52 0.00 south
1-1-S0---0032.830-0051.960 1 39.52 3.51 <unk>
1-1-S0---0032.830-0051.960 1 43.03 0.24 the
1-1-S0---0032.830-0051.960 1 43.27 0.87 creators
1-1-S0---0032.830-0051.960 1 44.14 0.06 of
1-1-S0---0032.830-0051.960 1 44.20 1.08 comedy
1-1-S0---0032.830-0051.960 1 45.28 1.47 central's
1-1-S0---0032.830-0051.960 1 46.75 0.00 south
1-1-S0---0032.830-0051.960 1 46.75 3.09 <unk>
1-1-S0---0032.830-0051.960 1 49.84 0.18 the
1-1-S0---0032.830-0051.960 1 50.02 0.54 creators
1-1-S0---0032.830-0051.960 1 50.56 0.03 of
1-1-S0---0032.830-0051.960 1 50.59 0.00 comedy
1-1-S0---0032.830-0051.960 1 50.59 1.11 central's
1-1-S0---0051.960-0064.490 1 51.96 0.42 stone
1-1-S0---0051.960-0064.490 1 52.38 0.09 and
1-1-S0---0051.960-0064.490 1 52.47 0.33 trey
1-1-S0---0051.960-0064.490 1 52.80 0.00 <unk>
1-1-S0---0051.960-0064.490 1 52.80 0.00 stone
1-1-S0---0051.960-0064.490 1 52.80 0.27 and
1-1-S0---0051.960-0064.490 1 53.07 0.00 trey
1-1-S0---0051.960-0064.490 1 53.07 0.00 <unk>
1-1-S0---0051.960-0064.490 1 53.07 0.00 stone
1-1-S0---0051.960-0064.490 1 53.07 0.00 and
1-1-S0---0051.960-0064.490 1 53.07 0.00 trey
1-1-S0---0051.960-0064.490 1 53.07 0.00 <unk>
1-1-S0---0051.960-0064.490 1 53.07 0.00 stone

Any suggestions would be appreciated. Regular ASR functionality (with kaldi) is working fine. FWIW my steps and utils are linked to kaldi and not to eesen.

aolney commented 5 years ago

The problem seemed to be in the Makefile. Instead of using the STM, it was running LIUM. Below is my align.sh that seems to have fixed this problem:

#!/bin/bash

# Copyright 2016  er1k
# Apache 2.0

# Prepare data for, and run align_ctc_utts.sh script that generates word-level alignments
# in an "Eesen Transccriber-centric" way  output is found in build/output/<basename>.ali

# Required inputs:
#
# * a 'hypothesis' text file for which to compute alignments, extension .txt
#   one utterance per line. If no hypothesis text is found, text
#   is obtained from the STM file below
# * an STM file with utterance/segment timings - 'perfect' transcription
# * an audio file, extension can vary (.mp3, .wav, .mp4 etc)

BASEDIR=$(dirname $0)
EESEN_ROOT=~/eesen

# Change these if you're using different models 
#GRAPH_DIR=$EESEN_ROOT/asr_egs/tedlium/v2-30ms/data/lang_phn_test_test_newlm
GRAPH_DIR=$EESEN_ROOT/asr_egs/tedlium/v2-30ms/data/lang_phn_test
MODEL_DIR=$EESEN_ROOT/asr_egs/tedlium/v2-30ms/exp/train_phn_l5_c320_v1s

# Defaults
frame_shift=0.03  # 30 ms frames
lm_weight=0.8     # same as best setting for 30ms eesen tedlium transcriber

. path.sh
. $BASEDIR/utils/parse_options.sh

filename=$(basename "$1")
basename="${filename%.*}"
dirname=$(dirname "$1")
extension="${filename##*.}"

cd $BASEDIR
echo "In $BASEDIR"

if [ $# -ne 1 ]; then
  echo "Usage: align.sh <basename>.{wav,mp3,mp4,sph}"
  echo " in same folder is test text named <basename>.txt"
  echo " and STM file named <basename>.stm (for segments)"
  echo " ./align.sh /vagrant/GaryFlake_2010.wav"
  echo " output is build/output/<basename>.ali"
  exit 1;
fi

mkdir -p $BASEDIR/build/audio/base $BASEDIR/build/output

# un-shorten-ify SPH files
#if [ $extension == "sph" ]; then
#    sph2pipe $1 > build/audio/base/$basename.unshorten
#    sox build/audio/base/$basename.unshorten -c 1 build/audio/base/$basename.wav rate -v 16k
#fi

mkdir -p $BASEDIR/src-audio
cp $1 $BASEDIR/src-audio
#prefixing with BASEDIR throws off make rule?
#make $BASEDIR/build/audio/base/$basename.wav
make build/audio/base/$basename.wav

# 8k
# sox $1 -c 1 -e signed-integer build/audio/base/$basename.wav rate -v 8k

mkdir -p $BASEDIR/build/diarization/$basename
# make STM from cha
if [ -f $dirname/$basename.cha -a ! -f $dirname/$basename.stm ]; then
  local/cha2stm.sh $dirname/$basename.cha | sed 's/xxx/\<unk\>/g' > build/output/$basename.stm
elif [ -f $dirname/$basename.stm ]; then
  cp $dirname/$basename.stm build/output/
elif [ ! -f $dirname/$basename.stm ]; then
  echo "Needs either a .cha or .stm file to get utterances"
  exit 1
fi

#if [ ! -f $dirname/$basename.txt ]; then
#  echo "Needs .txt file with utterance per line as reference text to align"
#  exit 1
#fi

# make segments from $1.stm
cat build/output/$basename.stm | grep -v ';;' | grep -v "inter_segment_gap" | grep -v "ignore_time_segment_in_scoring" | awk '{OFMT = "%.0f"; print $1,$2,$4*100,($5-$4)*100,"M S U",$2}' > build/diarization/$basename/show.seg

# Generate features
cd $BASEDIR
rm -rf build/trans/$basename

make SEGMENTS=show.seg build/trans/$basename/fbank

# Expect test text in format with utterance IDs per line
uttdata=build/trans/$basename
#if [ -f $dirname/$basename.txt ];
#  then
#    echo "Aligning text found at $dirname/$basename.txt"
#    cat $dirname/$basename.txt | awk '{print NR" "$0}' > $uttdata/text
#  else
    echo "Aligning text found in build/output/$basename.stm"
    cat build/output/$basename.stm | awk '{$1="";$2="";$3="";$4="";$5="";$6=""; print NR$0}' \
    | sed 's/ \+/ /' > $uttdata/text
#fi
cp build/diarization/$basename/show.seg $uttdata

#local/align_ctc_multi_utts.sh --acoustic_scale 0.8 $GRAPH_DIR $GRAPH_DIR $uttdata  $MODEL_DIR $uttdata/align
#                                                   <langdir>  <data>     <uttdata> <mdldir>   <dir>
local/align_ctc_multi_utts.sh --acoustic_scale $lm_weight $GRAPH_DIR $GRAPH_DIR $uttdata  $MODEL_DIR $uttdata/align

# Copy results to someplace useful
cp $uttdata/align/ali build/output/$basename.ali
fmetze commented 5 years ago

will need to look into this some other time, please let me know if you have other information or updates

aolney commented 5 years ago

Only that once the STM was properly used, the doubling issue went away. However, the alignments still seemed off.