Open marcoippolito opened 4 years ago
@marcoippolito Ciao Marco. I have the same dnn.jconf file configuration and it works fine, so that is not the problem. Your transcribing process seems to work fine too, until it crashes. I can suggest to force audio wav conversion to 16000 Hz with ffmpeg (I use Linux Mint, Ubuntu 18.04 based):
ffmpeg -i your_file.wav -acodec pcm_s16le -ac 1 -ar 16000 your_16000_file.wav
just to make sure of the encoding style.
@zdomjus60 Ciao!!
Thanks for your kind suggestion. Unfortunately there must be something else to fix:
(base) marco@pc01:~/cpp/speak/julius/ENVR-v5.4.Dnn.Bin$ ffmpeg -i mozilla.wav -acodec
pcm_s16le -ac 1 -ar 16000 mozilla_16000.wav
ffmpeg version 4.2.2-1ubuntu1~18.04.york0 Copyright (c) 2000-2019 the FFmpeg developers
built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
configuration: --prefix=/usr --extra-version='1ubuntu1~18.04.york0' --toolchain=hardened
--libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl
--disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls
--enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-
libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-
libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame
--enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse
--enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr
--enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab
--enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-
libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-
openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm
--enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264
--enable-shared
libavutil 56. 31.100 / 56. 31.100
libavcodec 58. 54.100 / 58. 54.100
libavformat 58. 29.100 / 58. 29.100
libavdevice 58. 8.100 / 58. 8.100
libavfilter 7. 57.100 / 7. 57.100
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 5.100 / 5. 5.100
libswresample 3. 5.100 / 3. 5.100
libpostproc 55. 5.100 / 55. 5.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'mozilla.wav':
Duration: 00:01:00.00, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to 'mozilla_16000.wav':
Metadata:
ISFT : Lavf58.29.100
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Metadata:
encoder : Lavc58.54.100 pcm_s16le
size= 1875kB time=00:01:00.00 bitrate= 256.0kbits/s speed=6.98e+03x
video:0kB audio:1875kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead:
0.004062%
(base) marco@pc01:~/cpp/speak/julius/ENVR-v5.4.Dnn.Bin$ ../julius/julius -C julius.jconf -dnnconf
dnn.jconf
STAT: include config: julius.jconf
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
STAT: parsing option string: "-htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic"
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
WARNING: m_chkparam: "-cmnstatic" was automatically enabled because you have specified
"-cmnload" at buffered input. To avoid confusion in the future release, please explicitly set "-c
mnstatic" for static CMN.
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: read_binhmm: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 15619
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 64681 nodes (32340 branch + 32341 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 7835 nodes (3917 branch + 3918 data)
Stat: init_phmm: logical names: 32341 in HMMList
Stat: init_phmm: base phones: 46 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: m_fusion: force multipath HMM handling by user request
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 0 pseudo phones are added to logical HMM list
Stat: dnn_init: use 1 threads for DNN computation (max 8 cores)
Stat: dnn_init: input: vec 48 * context 11 = 528 dim
Stat: dnn_init: input layer: 528 dim
Stat: dnn_init: 5 hidden layer(s): 1536 dim
Stat: dnn_init: output layer: 7461 dim
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_bias.npy
Stat: dnn_init: state prior loaded: ENVR-v5.3.prior
Stat: calc_dnn: FMA instructions built-in
Stat: calc_dnn: AVX instructions built-in
Stat: calc_dnn: SSE instructions built-in
Stat: clac_dnn: use FMA SIMD instruction (256bit)
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
Stat: init_voca: read 319354 words
Stat: init_ngram: reading in binary n-gram from ENVR-v5.3.lm
Stat: ngram_read_bin: file version: 5
Stat: ngram_read_bin_v5: this is backward 3-gram file
stat: ngram_read_bin_v5: reading 1-gram
stat: ngram_read_bin_v5: reading 2-gram
stat: ngram_read_bin_v5: reading 3-gram
Stat: ngram_read_bin_v5: reading additional LR 2-gram
Stat: ngram_read_bin: making entry name index
Stat: init_ngram: found unknown word entry "<unk>"
Stat: init_ngram: finished reading n-gram
Stat: init_ngram: mapping dictonary words to n-gram entries
Stat: init_ngram: finished word-to-ngram mapping
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 3290483 nod es
STAT: coordination check passed
STAT: make successor lists for unigram factoring
STAT: done
STAT: 1-gram factoring values has been pre-computed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
Stat: wav2mfcc-pipe: reading initial cepstral mean/variance from file "ENVR-v5.3.norm"
Stat: wav2mfcc-pipe: reading HTK-format cepstral vectors
Stat: wav2mfcc-pipe: finished reading CMN/CVN parameter
STAT: All init successfully done
STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.5 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension : WordsInt
- Compiled by : gcc -O6 -fomit-frame-pointer -fPIC
Library configuration: version 4.5
- Audio input
primary A/D-in driver : alsa (Advanced Linux Sound Architecture)
available drivers : alsa oss pulseaudio
wavefile formats : RAW and WAV only
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : integer (4 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
SSE AVX FMA
FMA is available maximum on this cpu, use it
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=ENVR-v5.3.am
hmmmapfilename=ENVR-v5.3.phn
Language Model:
- LM00 "_default"
vocabulary filename=ENVR-v5.3.dct
n-gram filename=ENVR-v5.3.lm (binary format)
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_A_Z (48 dim. from 15 cepstrum + energy with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 26
cepst. lifter = 22
raw energy = True
energy normalize = True (scale = 0.1, silence floor = 50.0 dB)
delta window = 2 frames (20.0 ms) around
acc window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = ON
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, with per-utterance self mean
cep. var. normalization = yes, with a static variance
static variance from file = ENVR-v5.3.norm
base setup from = HTK Config (and HTK defaults)
frame splicing = 11
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
15619 models, 7461 states, 7461 mpdfs, 119424 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_E_D_A_Z
vector length = 48
number of stream = 1
stream info = [0-47]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 32 Gaussians
max length of model = 5 states
logical base phones = 46
model skip trans. = exist, require multi-path handling
skippable models = sp (1 model(s))
AM Parameters:
Gaussian pruning = none (full computation) (-gprune)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use max. prob. of same LC)
sp transition penalty = -1.0
DNN parameters:
DNN input dim. = 528 (48 x 11)
DNN output dim. = 7461
# of hidden layers = 5
hidden layer dim. = 1536
state prior factor = 1.000000
state prior log10nize = off
batch size = 1
number of threads = 1
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=n-gram
N-gram info:
spec = 3-gram, backward (right-to-left)
OOV word = <unk>(id=0)
wordset size = 262145
1-gram entries = 262145 ( 2.0 MB)
2-gram entries = 16380163 (213.2 MB) (63% are valid contexts)
3-gram entries = 51815890 (474.1 MB)
LR 2-gram entries= 16380163 ( 63.5 MB)
pass1 = given additional forward 2-gram
Vocabulary Info:
vocabulary size = 319354 words, 2161689 models
average word len = 6.8 models, 20.3 states
maximum state num = 90 nodes per word
transparent words = not exist
words under class = not exist
Parameters:
(-silhead)head sil word = 1: "<s> @0.000000 [<s>] sil(sil)"
(-siltail)tail sil word = 0: "</s> @0.000000 [</s>] sil(sil)"
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 3290483
root node num = 1437
(149 hi-freq. words are separated from tree lexicon)
leaf node num = 319354
fact. node num = 319354
Inter-word N-gram cache:
root node to be cached = 263 / 1437 (isolated only)
word ends to be cached = 262145 (all)
max. allocation size = 275MB
(-lmp) pass1 LM weight = 12.0 ins. penalty = -6.0
(-lmp2) pass2 LM weight = 12.0 ins. penalty = -6.0
(-transp)trans. penalty = +0.0 per word
(-cmalpha)CM alpha coef = 0.050000
inter-word short pause = on (append "sp" for each word tail)
sp transition penalty = -1.0
Search parameters:
multi-path handling = yes, multi-path mode enabled
(-b) trellis beam width = 4000
(-bs)score pruning thres= disabled
(-n)search candidate num= 40
(-s) search stack size = 2000
(-m) search overflow = after 8000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 360
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 40 candidates found
(-output) and output = 1 candidates out of above
factoring score: 1-gram prob. (statically assigned beforehand)
output word alignments
short pause segmentation = on
sp duration length = 10 frames
fall back on search fail = on, adopt 1st pass result as final
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = (forced) buffered, batch
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = waveform file
input filelist = test.dbl
sampling freq. = 16000 Hz required
threaded A/D-in = supported, off
zero frames stripping = on
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
chunk size = 1000 samples
FVAD switch value = -1 (disabled)
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = off
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean and variance norm. for batch decoding: *
* constant mean and variance was loaded from file. *
* they will be applied constantly for all input. *
*************************************************************
------
### read waveform input
Stat: adin_file: input speechfile: mozilla.wav
Warning: strip: sample 212-232 has zero value, stripped
Warning: strip: sample 312-327 has zero value, stripped
Warning: strip: sample 391-406 has zero value, stripped
Warning: strip: sample 914-930 has zero value, stripped
Warning: strip: sample 51221-51244 has zero value, stripped
Warning: strip: sample 112765-112783 has zero value, stripped
Warning: strip: sample 113264-113279 has zero value, stripped
Warning: strip: sample 113394-113409 has zero value, stripped
Warning: strip: sample 113701-113719 has zero value, stripped
Warning: strip: sample 114939-114959 has zero value, stripped
Warning: strip: sample 115667-115682 has zero value, stripped
Warning: strip: sample 115932-115948 has zero value, stripped
Warning: strip: sample 116475-116490 has zero value, stripped
Warning: strip: sample 116605-116623 has zero value, stripped
Warning: strip: sample 117040-117055 has zero value, stripped
Warning: strip: sample 117490-117507 has zero value, stripped
Warning: strip: sample 868-884 has zero value, stripped
STAT: 50800 samples (3.17 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
pass1_best: <s> without the data said the article was useless </s>
pass1_best_wordseq: <s> without the data said the article was useless </s>
pass1_best_phonemeseq: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w
ax z | y uw s l ah s | sil
pass1_best_score: 282.374390
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 42566 generated, 7034 pushed, 454 nodes popped in 306
ALIGN: === word alignment begin ===
sentence1: <s> without the data said the article was useless </s>
wseq1: <s> without the data said the article was useless </s>
phseq1: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w ax z | y uw s l ah s |
sil
cmscore1: 0.785 0.892 0.318 0.284 0.669 0.701 0.818 0.103 0.528 1.000
score1: 261.947144
=== begin forced alignment ===
-- word alignment --
id: from to n_score unit
----------------------------------------
[ 0 17] 0.684685 <s> [<s>]
[ 18 51] 2.635743 without [without]
[ 52 62] 1.305501 the [the]
[ 63 91] 2.258720 data [data]
[ 92 129] 2.324036 said [said]
[ 130 138] 2.702694 the [the]
[ 139 173] 2.214880 article [article]
[ 174 194] 1.476129 was [was]
[ 195 264] 2.325301 useless [useless]
[ 265 305] 0.749876 </s> [</s>]
re-computed AM score: 593.854553
=== end forced alignment ===
STAT: 24800 samples (1.55 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
pass1_best: <s> i've got go to him
pass1_best_wordseq: <s> i've got go to him
pass1_best_phonemeseq: sil | ah ih b | g aa t | g ow | t ah | hh ih m
pass1_best_score: 120.275803
### Recognition: 2nd pass (RL heuristic best-first)
Segmentation fault (core dumped)
@marcoippolito Uhm, ok. Must search another solution. I attach my full dump for you it's Boris Johnson's speech as Prime Minister, I'm not a fan but ... just as a sample :-). Think of it as a successful transcription (yeah, many words wrong (god bless the green), but .. julius has its limits). I'm curious about 2nd pass. speech.txt
https://www.youtube.com/watch?v=YypKBfFtovU (this looks like the audio content)
Unsure if related but here is
### Recognition: 2nd pass (RL heuristic best-first)
Segmentation fault
when using -flto. (-O2, gcc:13.2.1 -march=skylake; more build details upon request)
Smooth as silk otherwise when applying README.md alike wav file testing
Just tested -flto -O1 works fine here. So testing -O1 to -O2 additional features one by one could help narrowing the problem origin here.
I compiled and install Julius in Ubuntu 18.04.4 Desktop in a Laptop, and then modified /ENVR-v5.4.Dnn.Bin/dnn.jconf as follows:
The execution of the example leads to Segmentation fault (core dumped)
gcc version 9.3.0 (Ubuntu 9.3.0-11ubuntu0~18.04.1)
I installed also Julius in a PC with Ubuntu 18.04.4 and gcc version 9.3.0 (Ubuntu 9.3.0-11ubuntu0~18.04.1) but got the same problem: Segmentation fault (core dumped)
Stat: init_voca: read 319354 words Stat: init_ngram: reading in binary n-gram from ENVR-v5.3.lm Stat: ngram_read_bin: file version: 5 Stat: ngram_read_bin_v5: this is backward 3-gram file stat: ngram_read_bin_v5: reading 1-gram stat: ngram_read_bin_v5: reading 2-gram stat: ngram_read_bin_v5: reading 3-gram Stat: ngram_read_bin_v5: reading additional LR 2-gram Stat: ngram_read_bin: making entry name index Stat: init_ngram: found unknown word entry ""
Stat: init_ngram: finished reading n-gram
Stat: init_ngram: mapping dictonary words to n-gram entries
Stat: init_ngram: finished word-to-ngram mapping
STAT: LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 3290483 nodes
STAT: coordination check passed
STAT: make successor lists for unigram factoring
STAT: done
STAT: 1-gram factoring values has been pre-computed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
Stat: wav2mfcc-pipe: reading initial cepstral mean/variance from file "ENVR-v5.3.norm"
Stat: wav2mfcc-pipe: reading HTK-format cepstral vectors
Stat: wav2mfcc-pipe: finished reading CMN/CVN parameter
STAT: All init successfully done
How to solve the problem? Looking forward to your kind help. Marco