julius-speech / julius

Open-Source Large Vocabulary Continuous Speech Recognition Engine
BSD 3-Clause "New" or "Revised" License
1.84k stars 300 forks source link

/julius/ENVR-v5.4.Dnn.Bin$ ../julius/julius -C julius.jconf Segmentation fault (core dumped) #132

Open marcoippolito opened 4 years ago

marcoippolito commented 4 years ago

I compiled and install Julius in Ubuntu 18.04.4 Desktop in a Laptop, and then modified /ENVR-v5.4.Dnn.Bin/dnn.jconf as follows:

feature_type MFCC_E_D_A_Z
feature_options -htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic
num_threads 1
feature_len 48
context_len 11
input_nodes 528
output_nodes 7461
hidden_nodes 1536
hidden_layers 5
W1 ENVR-v5.3.layer2_weight.npy
W2 ENVR-v5.3.layer3_weight.npy
W3 ENVR-v5.3.layer4_weight.npy
W4 ENVR-v5.3.layer5_weight.npy
W5 ENVR-v5.3.layer6_weight.npy
B1 ENVR-v5.3.layer2_bias.npy
B2 ENVR-v5.3.layer3_bias.npy
B3 ENVR-v5.3.layer4_bias.npy
B4 ENVR-v5.3.layer5_bias.npy
B5 ENVR-v5.3.layer6_bias.npy
output_W ENVR-v5.3.layerout_weight.npy
output_B ENVR-v5.3.layerout_bias.npy
state_prior_factor 1.0
state_prior ENVR-v5.3.prior
state_prior_log10nize false

The execution of the example leads to Segmentation fault (core dumped)

(base) marco@marco-U36SG:~/cpp/speak/julius/ENVR-v5.4.Dnn.Bin$ ../julius/julius -C 
julius.jconf -dnnconf dnn.jconf
STAT: include config: julius.jconf
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
STAT: parsing option string: "-htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic"
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
WARNING: m_chkparam: "-cmnstatic" was automatically enabled because you have specified     
"-cmnload" at buffered input.  To avoid confusion in the future release, please explicitly set 
"-cmnstatic" for static CMN.
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: read_binhmm: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 15619
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 64681 nodes (32340 branch + 32341 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 7835 nodes (3917 branch + 3918 data)
Stat: init_phmm: logical names: 32341 in HMMList
Stat: init_phmm: base phones:    46 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: m_fusion: force multipath HMM handling by user request
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 0 pseudo phones are added to logical HMM list
Stat: dnn_init: use 1 threads for DNN computation (max 4 cores)
Stat: dnn_init: input: vec 48 * context 11 = 528 dim
Stat: dnn_init: input layer: 528 dim
Stat: dnn_init: 5 hidden layer(s): 1536 dim
Stat: dnn_init: output layer: 7461 dim
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_bias.npy
Stat: dnn_init: state prior loaded: ENVR-v5.3.prior 
Stat: calc_dnn: FMA instructions built-in
Stat: calc_dnn: AVX instructions built-in
Stat: calc_dnn: SSE instructions built-in
Stat: clac_dnn: use AVX SIMD instruction (256bit)
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
Stat: init_voca: read 319354 words
Stat: init_ngram: reading in binary n-gram from ENVR-v5.3.lm
Stat: ngram_read_bin: file version: 5
Stat: ngram_read_bin_v5: this is backward 3-gram file
stat: ngram_read_bin_v5: reading 1-gram
stat: ngram_read_bin_v5: reading 2-gram
stat: ngram_read_bin_v5: reading 3-gram 
Stat: ngram_read_bin_v5: reading additional LR 2-gram
Stat: ngram_read_bin: making entry name index
Stat: init_ngram: found unknown word entry "<unk>"
Stat: init_ngram: finished reading n-gram
Stat: init_ngram: mapping dictonary words to n-gram entries
Stat: init_ngram: finished word-to-ngram mapping
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 3290483 nodes
STAT: coordination check passed
STAT: make successor lists for unigram factoring
STAT: done
STAT:  1-gram factoring values has been pre-computed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
Stat: wav2mfcc-pipe: reading initial cepstral mean/variance from file "ENVR-v5.3.norm"
Stat: wav2mfcc-pipe: reading HTK-format cepstral vectors
Stat: wav2mfcc-pipe: finished reading CMN/CVN parameter
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.5 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : WordsInt
 -  Compiled by  : gcc -O6 -fomit-frame-pointer -fPIC
Library configuration: version 4.5
 - Audio input
    primary A/D-in driver   : alsa (Advanced Linux Sound Architecture)
    available drivers       : alsa oss pulseaudio
    wavefile formats        : RAW and WAV only
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : integer (4 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN
    SSE AVX FMA
    AVX is available maximum on this cpu, use it

------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
    hmmfilename=ENVR-v5.3.am
    hmmmapfilename=ENVR-v5.3.phn

 Language Model:
 - LM00 "_default"
    vocabulary filename=ENVR-v5.3.dct
    n-gram  filename=ENVR-v5.3.lm (binary format)

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
           parameter = MFCC_E_D_A_Z (48 dim. from 15 cepstrum + energy with CMN)
         sample frequency = 16000 Hz
       sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
        pre-emphasis = 0.97
        # filterbank = 26
       cepst. lifter = 22
          raw energy = True
    energy normalize = True (scale = 0.1, silence floor = 50.0 dB)
    delta window = 2 frames (20.0 ms) around
          acc window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = ON
                VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, with per-utterance self mean
 cep. var. normalization = yes, with a static variance
static variance from file = ENVR-v5.3.norm

     base setup from = HTK Config (and HTK defaults)
      frame splicing = 11

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    15619 models, 7461 states, 7461 mpdfs, 119424 Gaussians are defined
          model type = context dependency handling ON
      training parameter = MFCC_E_D_A_Z
       vector length = 48
    number of stream = 1
         stream info = [0-47]
    cov. matrix type = DIAGC
       duration type = NULLD
    max mixture size = 32 Gaussians
     max length of model = 5 states
     logical base phones = 46
       model skip trans. = exist, require multi-path handling
      skippable models = sp (1 model(s))

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
       short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use max. prob. of same LC)
   sp transition penalty = -1.0

 DNN parameters:
          DNN input dim. = 528 (48 x 11)
         DNN output dim. = 7461
      # of hidden layers = 5
       hidden layer dim. = 1536
      state prior factor = 1.000000
   state prior log10nize = off
              batch size = 1
       number of threads = 1

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=n-gram

 N-gram info:
                spec = 3-gram, backward (right-to-left)
            OOV word = <unk>(id=0)
        wordset size = 262145
      1-gram entries =     262145  (  2.0 MB)
      2-gram entries =   16380163  (213.2 MB) (63% are valid contexts)
      3-gram entries =   51815890  (474.1 MB)
    LR 2-gram entries=   16380163  ( 63.5 MB)
               pass1 = given additional forward 2-gram

 Vocabulary Info:
        vocabulary size  = 319354 words, 2161689 models
        average word len = 6.8 models, 20.3 states
       maximum state num = 90 nodes per word
       transparent words = not exist
       words under class = not exist

 Parameters:
    (-silhead)head sil word = 1: "<s> @0.000000 [<s>] sil(sil)"
    (-siltail)tail sil word = 0: "</s> @0.000000 [</s>] sil(sil)"

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
     total node num = 3290483
      root node num =   1437
    (149 hi-freq. words are separated from tree lexicon)
      leaf node num = 319354
     fact. node num = 319354

 Inter-word N-gram cache: 
    root node to be cached = 263 / 1437 (isolated only)
    word ends to be cached = 262145 (all)
      max. allocation size = 275MB
    (-lmp)  pass1 LM weight = 12.0  ins. penalty = -6.0
    (-lmp2) pass2 LM weight = 12.0  ins. penalty = -6.0
    (-transp)trans. penalty = +0.0 per word
    (-cmalpha)CM alpha coef = 0.050000

     inter-word short pause = on (append "sp" for each word tail)
      sp transition penalty = -1.0
  Search parameters: 
        multi-path handling = yes, multi-path mode enabled
    (-b) trellis beam width = 4000
    (-bs)score pruning thres= disabled
    (-n)search candidate num= 40
    (-s)  search stack size = 2000
    (-m)    search overflow = after 8000 hypothesis poped
            2nd pass method = searching sentence, generating N-best
    (-b2)  pass2 beam width = 360
    (-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
    (-sb)2nd scan beamthres = 80.0 (in logscore)
    (-n)        search till = 40 candidates found
    (-output)    and output = 1 candidates out of above
     factoring score: 1-gram prob. (statically assigned beforehand)
     output word alignments
    short pause segmentation = on
          sp duration length = 10 frames
    fall back on search fail = on, adopt 1st pass result as final

------------------------------------------------------------
Decoding algorithm:

    1st pass input processing = (forced) buffered, batch
    1st pass method = 1-best approx. generating indexed trellis
    output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                 input type = waveform
               input source = waveform file
              input filelist = test.dbl
              sampling freq. = 16000 Hz required
             threaded A/D-in = supported, off
       zero frames stripping = on
             silence cutting = on
                 level thres = 2000 / 32767
             zerocross thres = 60 / sec.
                 head margin = 300 msec.
                 tail margin = 400 msec.
                  chunk size = 1000 samples
               FVAD switch value = -1 (disabled)
        long-term DC removal = off
        level scaling factor = 1.00 (disabled)
          reject short input = off
          reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
    *************************************************************
    * Cepstral mean and variance norm. for batch decoding:      *
    * constant mean and variance was loaded from file.          *
    * they will be applied constantly for all input.            *
    *************************************************************

------
### read waveform input
Stat: adin_file: input speechfile: mozilla.wav
 Warning: strip: sample 212-232 has zero value, stripped
Warning: strip: sample 312-327 has zero value, stripped
Warning: strip: sample 391-406 has zero value, stripped
Warning: strip: sample 914-930 has zero value, stripped
Warning: strip: sample 51221-51244 has zero value, stripped
Warning: strip: sample 112765-112783 has zero value, stripped
Warning: strip: sample 113264-113279 has zero value, stripped
Warning: strip: sample 113394-113409 has zero value, stripped
Warning: strip: sample 113701-113719 has zero value, stripped
Warning: strip: sample 114939-114959 has zero value, stripped
Warning: strip: sample 115667-115682 has zero value, stripped
Warning: strip: sample 115932-115948 has zero value, stripped
Warning: strip: sample 116475-116490 has zero value, stripped
Warning: strip: sample 116605-116623 has zero value, stripped
Warning: strip: sample 117040-117055 has zero value, stripped
Warning: strip: sample 117490-117507 has zero value, stripped
Warning: strip: sample 868-884 has zero value, stripped
STAT: 50800 samples (3.17 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)

pass1_best: <s> without the data said the article was useless </s>
pass1_best_wordseq: <s> without the data said the article was useless </s>
pass1_best_phonemeseq: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w 
ax z | y uw s l ah s | sil
pass1_best_score: 282.374634
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 42566 generated, 7034 pushed, 454 nodes popped in 306
ALIGN: === word alignment begin ===
sentence1: <s> without the data said the article was useless </s>
wseq1: <s> without the data said the article was useless </s>
phseq1: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w ax z | y uw s l ah s |
 sil
cmscore1: 0.785 0.892 0.318 0.284 0.669 0.701 0.818 0.103 0.528 1.000
score1: 261.947296
=== begin forced alignment ===
-- word alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   17]  0.684684  <s>  [<s>]
[  18   51]  2.635746  without  [without]
[  52   62]  1.305503  the  [the]
[  63   91]  2.258721  data [data]
[  92  129]  2.324038  said [said]
[ 130  138]  2.702694  the  [the]
[ 139  173]  2.214879  article  [article]
[ 174  194]  1.476131  was  [was]
[ 195  264]  2.325302  useless  [useless]
[ 265  305]  0.749879  </s> [</s>]
re-computed AM score: 593.854919
=== end forced alignment ===

STAT: 24800 samples (1.55 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)
-pass1_best: <s> i've got go to him
pass1_best_wordseq: <s> i've got go to him
pass1_best_phonemeseq: sil | ah ih b | g aa t | g ow | t ah | hh ih m
pass1_best_score: 120.275818
### Recognition: 2nd pass (RL heuristic best-first)
Segmentation fault (core dumped)

gcc version 9.3.0 (Ubuntu 9.3.0-11ubuntu0~18.04.1)

I installed also Julius in a PC with Ubuntu 18.04.4 and gcc version 9.3.0 (Ubuntu 9.3.0-11ubuntu0~18.04.1) but got the same problem: Segmentation fault (core dumped)

(base) marco@pc01:~/cpp/speak/julius/ENVR-v5.4.Dnn.Bin$ ../julius/julius -C julius.jconf -dnnconf
dnn.jconf
STAT: include config: julius.jconf
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
STAT: parsing option string: "-htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic"
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
WARNING: m_chkparam: "-cmnstatic" was automatically enabled because you have specified   
"-cmnload" at buffered input.  To avoid confusion in the future release, please explicitly set  
"-cmnstatic" for static CMN.
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: read_binhmm: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 15619
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 64681 nodes (32340 branch + 32341 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 7835 nodes (3917 branch + 3918 data)
Stat: init_phmm: logical names: 32341 in HMMList
Stat: init_phmm: base phones:    46 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: m_fusion: force multipath HMM handling by user request
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 0 pseudo phones are added to logical HMM list
Stat: dnn_init: use 1 threads for DNN computation (max 8 cores)
Stat: dnn_init: input: vec 48 * context 11 = 528 dim
Stat: dnn_init: input layer: 528 dim
Stat: dnn_init: 5 hidden layer(s): 1536 dim
Stat: dnn_init: output layer: 7461 dim
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_bias.npy
Stat: dnn_init: state prior loaded: ENVR-v5.3.prior
Stat: calc_dnn: FMA instructions built-in
Stat: calc_dnn: AVX instructions built-in
Stat: calc_dnn: SSE instructions built-in
Stat: clac_dnn: use FMA SIMD instruction (256bit)
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default

Stat: init_voca: read 319354 words Stat: init_ngram: reading in binary n-gram from ENVR-v5.3.lm Stat: ngram_read_bin: file version: 5 Stat: ngram_read_bin_v5: this is backward 3-gram file stat: ngram_read_bin_v5: reading 1-gram stat: ngram_read_bin_v5: reading 2-gram stat: ngram_read_bin_v5: reading 3-gram Stat: ngram_read_bin_v5: reading additional LR 2-gram Stat: ngram_read_bin: making entry name index Stat: init_ngram: found unknown word entry "" Stat: init_ngram: finished reading n-gram Stat: init_ngram: mapping dictonary words to n-gram entries Stat: init_ngram: finished word-to-ngram mapping STAT: LM00 _default loaded STAT: ------ STAT: All models are ready, go for final fusion STAT: [1] create MFCC extraction instance(s) STAT: create MFCC calculation modules from AM STAT: AM 0 _default: create a new module MFCC01 STAT: 1 MFCC modules created STAT: [2] create recognition processing instance(s) with AM and LM STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default) STAT: Building HMM lexicon tree STAT: lexicon size: 3290483 nodes STAT: coordination check passed STAT: make successor lists for unigram factoring STAT: done STAT: 1-gram factoring values has been pre-computed STAT: SR00 _default composed STAT: [3] initialize for acoustic HMM calculation Stat: outprob_init: state-level mixture PDFs, use calc_mix() Stat: addlog: generating addlog table (size = 1953 kB) Stat: addlog: addlog table generated STAT: [4] prepare MFCC storage(s) Stat: wav2mfcc-pipe: reading initial cepstral mean/variance from file "ENVR-v5.3.norm" Stat: wav2mfcc-pipe: reading HTK-format cepstral vectors Stat: wav2mfcc-pipe: finished reading CMN/CVN parameter STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.5 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : WordsInt
 -  Compiled by  : gcc -O6 -fomit-frame-pointer -fPIC
Library configuration: version 4.5
 - Audio input
    primary A/D-in driver   : alsa (Advanced Linux Sound Architecture)
    available drivers       : alsa oss pulseaudio
    wavefile formats        : RAW and WAV only
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : integer (4 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN
    SSE AVX FMA
    FMA is available maximum on this cpu, use it

------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
    hmmfilename=ENVR-v5.3.am
    hmmmapfilename=ENVR-v5.3.phn

 Language Model:
 - LM00 "_default"
    vocabulary filename=ENVR-v5.3.dct
    n-gram  filename=ENVR-v5.3.lm (binary format)

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
           parameter = MFCC_E_D_A_Z (48 dim. from 15 cepstrum + energy with CMN)
    sample frequency = 16000 Hz
       sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
        pre-emphasis = 0.97
        # filterbank = 26
       cepst. lifter = 22
          raw energy = True
    energy normalize = True (scale = 0.1, silence floor = 50.0 dB)
        delta window = 2 frames (20.0 ms) around
          acc window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = ON
                VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, with per-utterance self mean
 cep. var. normalization = yes, with a static variance
 static variance from file = ENVR-v5.3.norm

     base setup from = HTK Config (and HTK defaults)
      frame splicing = 11

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    15619 models, 7461 states, 7461 mpdfs, 119424 Gaussians are defined
          model type = context dependency handling ON
      training parameter = MFCC_E_D_A_Z
       vector length = 48
    number of stream = 1
          stream info = [0-47]
    cov. matrix type = DIAGC
       duration type = NULLD
    max mixture size = 32 Gaussians
      max length of model = 5 states
     logical base phones = 46
       model skip trans. = exist, require multi-path handling
      skippable models = sp (1 model(s))

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use max. prob. of same LC)
   sp transition penalty = -1.0

 DNN parameters:
          DNN input dim. = 528 (48 x 11)
         DNN output dim. = 7461
      # of hidden layers = 5
       hidden layer dim. = 1536
      state prior factor = 1.000000
   state prior log10nize = off
              batch size = 1
       number of threads = 1

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=n-gram

 N-gram info:
                spec = 3-gram, backward (right-to-left)
            OOV word = <unk>(id=0)
        wordset size = 262145
      1-gram entries =     262145  (  2.0 MB)
      2-gram entries =   16380163  (213.2 MB) (63% are valid contexts)
      3-gram entries =   51815890  (474.1 MB)
    LR 2-gram entries=   16380163  ( 63.5 MB)
               pass1 = given additional forward 2-gram

 Vocabulary Info:
        vocabulary size  = 319354 words, 2161689 models
        average word len = 6.8 models, 20.3 states
       maximum state num = 90 nodes per word
       transparent words = not exist
       words under class = not exist

  Parameters:
    (-silhead)head sil word = 1: "<s> @0.000000 [<s>] sil(sil)"
    (-siltail)tail sil word = 0: "</s> @0.000000 [</s>] sil(sil)"

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
     total node num = 3290483
      root node num =   1437
    (149 hi-freq. words are separated from tree lexicon)
      leaf node num = 319354
     fact. node num = 319354

 Inter-word N-gram cache: 
    root node to be cached = 263 / 1437 (isolated only)
    word ends to be cached = 262145 (all)
       max. allocation size = 275MB
    (-lmp)  pass1 LM weight = 12.0  ins. penalty = -6.0
    (-lmp2) pass2 LM weight = 12.0  ins. penalty = -6.0
    (-transp)trans. penalty = +0.0 per word
    (-cmalpha)CM alpha coef = 0.050000

     inter-word short pause = on (append "sp" for each word tail)
      sp transition penalty = -1.0
 Search parameters: 
        multi-path handling = yes, multi-path mode enabled
    (-b) trellis beam width = 4000
    (-bs)score pruning thres= disabled
    (-n)search candidate num= 40
    (-s)  search stack size = 2000
    (-m)    search overflow = after 8000 hypothesis poped
            2nd pass method = searching sentence, generating N-best
    (-b2)  pass2 beam width = 360
    (-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
    (-sb)2nd scan beamthres = 80.0 (in logscore)
    (-n)        search till = 40 candidates found
     (-output)    and output = 1 candidates out of above
     factoring score: 1-gram prob. (statically assigned beforehand)
     output word alignments
    short pause segmentation = on
          sp duration length = 10 frames
    fall back on search fail = on, adopt 1st pass result as final

------------------------------------------------------------
Decoding algorithm:

    1st pass input processing = (forced) buffered, batch
    1st pass method = 1-best approx. generating indexed trellis
    output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                 input type = waveform
               input source = waveform file
              input filelist = test.dbl
              sampling freq. = 16000 Hz required
             threaded A/D-in = supported, off
       zero frames stripping = on
             silence cutting = on
                 level thres = 2000 / 32767
              zerocross thres = 60 / sec.
                  head margin = 300 msec.
                 tail margin = 400 msec.
                  chunk size = 1000 samples
           FVAD switch value = -1 (disabled)
        long-term DC removal = off
        level scaling factor = 1.00 (disabled)
          reject short input = off
          reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
    *************************************************************
    * Cepstral mean and variance norm. for batch decoding:      *
    * constant mean and variance was loaded from file.          *
    * they will be applied constantly for all input.            *
    *************************************************************

------
### read waveform input
Stat: adin_file: input speechfile: mozilla.wav
Warning: strip: sample 212-232 has zero value, stripped
Warning: strip: sample 312-327 has zero value, stripped
Warning: strip: sample 391-406 has zero value, stripped
Warning: strip: sample 914-930 has zero value, stripped
Warning: strip: sample 51221-51244 has zero value, stripped
Warning: strip: sample 112765-112783 has zero value, stripped
Warning: strip: sample 113264-113279 has zero value, stripped
Warning: strip: sample 113394-113409 has zero value, stripped
Warning: strip: sample 113701-113719 has zero value, stripped
Warning: strip: sample 114939-114959 has zero value, stripped
Warning: strip: sample 115667-115682 has zero value, stripped
Warning: strip: sample 115932-115948 has zero value, stripped
Warning: strip: sample 116475-116490 has zero value, stripped
Warning: strip: sample 116605-116623 has zero value, stripped
Warning: strip: sample 117040-117055 has zero value, stripped
Warning: strip: sample 117490-117507 has zero value, stripped
Warning: strip: sample 868-884 has zero value, stripped
STAT: 50800 samples (3.17 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)

pass1_best: <s> without the data said the article was useless </s>
pass1_best_wordseq: <s> without the data said the article was useless </s>
pass1_best_phonemeseq: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w ax
z | y uw s l ah s | sil
pass1_best_score: 282.374390
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 42566 generated, 7034 pushed, 454 nodes popped in 306
ALIGN: === word alignment begin ===
sentence1: <s> without the data said the article was useless </s>
wseq1: <s> without the data said the article was useless </s>
phseq1: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w ax z | y uw s l ah s | 
sil
cmscore1: 0.785 0.892 0.318 0.284 0.669 0.701 0.818 0.103 0.528 1.000
score1: 261.947144
=== begin forced alignment ===
-- word alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   17]  0.684685  <s>  [<s>]
[  18   51]  2.635743  without  [without]
[  52   62]  1.305501  the  [the]
[  63   91]  2.258720  data [data]
[  92  129]  2.324036  said [said]
[ 130  138]  2.702694  the  [the]
[ 139  173]  2.214880  article  [article]
[ 174  194]  1.476129  was  [was]
[ 195  264]  2.325301  useless  [useless]
[ 265  305]  0.749876  </s> [</s>]
re-computed AM score: 593.854553
=== end forced alignment ===

STAT: 24800 samples (1.55 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)

pass1_best: <s> i've got go to him
pass1_best_wordseq: <s> i've got go to him
pass1_best_phonemeseq: sil | ah ih b | g aa t | g ow | t ah | hh ih m
pass1_best_score: 120.275803
### Recognition: 2nd pass (RL heuristic best-first)
Segmentation fault (core dumped)

How to solve the problem? Looking forward to your kind help. Marco

zdomjus60 commented 4 years ago

@marcoippolito Ciao Marco. I have the same dnn.jconf file configuration and it works fine, so that is not the problem. Your transcribing process seems to work fine too, until it crashes. I can suggest to force audio wav conversion to 16000 Hz with ffmpeg (I use Linux Mint, Ubuntu 18.04 based):

ffmpeg -i your_file.wav -acodec pcm_s16le -ac 1 -ar 16000 your_16000_file.wav

just to make sure of the encoding style.

marcoippolito commented 4 years ago

@zdomjus60 Ciao!!

Thanks for your kind suggestion. Unfortunately there must be something else to fix:

(base) marco@pc01:~/cpp/speak/julius/ENVR-v5.4.Dnn.Bin$ ffmpeg -i mozilla.wav -acodec 
pcm_s16le -ac 1 -ar 16000 mozilla_16000.wav
ffmpeg version 4.2.2-1ubuntu1~18.04.york0 Copyright (c) 2000-2019 the FFmpeg developers
  built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04)
  configuration: --prefix=/usr --extra-version='1ubuntu1~18.04.york0' --toolchain=hardened    
--libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl  
--disable-stripping --enable-avresample --disable-filter=resample --enable-avisynth --enable-gnutls 
--enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-
libcaca --enable-libcdio --enable-libcodec2 --enable-libflite --enable-libfontconfig --enable-   
libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame  
--enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse 
--enable-librsvg --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr 
--enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab 
--enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-   
libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-lv2 --enable-omx --enable-
openal --enable-opencl --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm   
--enable-libiec61883 --enable-nvenc --enable-chromaprint --enable-frei0r --enable-libx264    
--enable-shared
  libavutil      56. 31.100 / 56. 31.100
  libavcodec     58. 54.100 / 58. 54.100
  libavformat    58. 29.100 / 58. 29.100
  libavdevice    58.  8.100 / 58.  8.100
  libavfilter     7. 57.100 /  7. 57.100
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  5.100 /  5.  5.100
  libswresample   3.  5.100 /  3.  5.100
  libpostproc    55.  5.100 / 55.  5.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'mozilla.wav':
  Duration: 00:01:00.00, bitrate: 256 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
Stream mapping:
  Stream #0:0 -> #0:0 (pcm_s16le (native) -> pcm_s16le (native))
Press [q] to stop, [?] for help
Output #0, wav, to 'mozilla_16000.wav':
  Metadata:
    ISFT            : Lavf58.29.100
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, mono, s16, 256 kb/s
    Metadata:
      encoder         : Lavc58.54.100 pcm_s16le
size=    1875kB time=00:01:00.00 bitrate= 256.0kbits/s speed=6.98e+03x    
video:0kB audio:1875kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead:   
0.004062%

(base) marco@pc01:~/cpp/speak/julius/ENVR-v5.4.Dnn.Bin$ ../julius/julius -C julius.jconf -dnnconf
 dnn.jconf
STAT: include config: julius.jconf
Stat: para: parsing HTK Config file: wav_config
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
STAT: parsing option string: "-htkconf wav_config -cvn -cmnload ENVR-v5.3.norm -cvnstatic"
Stat: para: parsing HTK Config file: wav_config  
Warning: para: "SOURCEFORMAT" ignored (not supported, or irrelevant)
Warning: para: TARGETKIND skipped (will be determined by AM header)
Stat: para: TARGETRATE=100000.0
Warning: para: "SAVECOMPRESSED" ignored (not supported, or irrelevant)
Warning: para: "SAVEWITHCRC" ignored (not supported, or irrelevant)
Stat: para: WINDOWSIZE=250000.0
Stat: para: USEHAMMING=T
Stat: para: PREEMCOEF=0.97
Stat: para: NUMCHANS=26
Stat: para: CEPLIFTER=22
Warning: para: NUMCEPS skipped (will be determined by AM header)
Warning: no SOURCERATE found
Warning: assume source waveform sample rate to 625 (16kHz)
WARNING: m_chkparam: "-cmnstatic" was automatically enabled because you have specified    
"-cmnload" at buffered input.  To avoid confusion in the future release, please explicitly set "-c
mnstatic" for static CMN.
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: check_hmm_restriction: an HMM with several arcs from initial state found: "sp"
Stat: read_binhmm: this HMM requires multipath handling at decoding
Stat: init_phmm: defined HMMs: 15619
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 64681 nodes (32340 branch + 32341 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 7835 nodes (3917 branch + 3918 data)
Stat: init_phmm: logical names: 32341 in HMMList
Stat: init_phmm: base phones:    46 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: m_fusion: force multipath HMM handling by user request 
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 0 pseudo phones are added to logical HMM list
Stat: dnn_init: use 1 threads for DNN computation (max 8 cores)
Stat: dnn_init: input: vec 48 * context 11 = 528 dim
Stat: dnn_init: input layer: 528 dim
Stat: dnn_init: 5 hidden layer(s): 1536 dim
Stat: dnn_init: output layer: 7461 dim
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer2_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer3_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer4_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer5_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layer6_bias.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_weight.npy
Stat: dnn_layer_load: loaded ENVR-v5.3.layerout_bias.npy
Stat: dnn_init: state prior loaded: ENVR-v5.3.prior
Stat: calc_dnn: FMA instructions built-in
Stat: calc_dnn: AVX instructions built-in
Stat: calc_dnn: SSE instructions built-in
Stat: clac_dnn: use FMA SIMD instruction (256bit)
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
Stat: init_voca: read 319354 words
Stat: init_ngram: reading in binary n-gram from ENVR-v5.3.lm
Stat: ngram_read_bin: file version: 5
Stat: ngram_read_bin_v5: this is backward 3-gram file
stat: ngram_read_bin_v5: reading 1-gram
stat: ngram_read_bin_v5: reading 2-gram
stat: ngram_read_bin_v5: reading 3-gram
Stat: ngram_read_bin_v5: reading additional LR 2-gram
Stat: ngram_read_bin: making entry name index
Stat: init_ngram: found unknown word entry "<unk>"
Stat: init_ngram: finished reading n-gram
Stat: init_ngram: mapping dictonary words to n-gram entries
Stat: init_ngram: finished word-to-ngram mapping
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree    
STAT: lexicon size: 3290483 nod es
STAT: coordination check passed
STAT: make successor lists for unigram factoring
STAT: done
STAT:  1-gram factoring values has been pre-computed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
Stat: wav2mfcc-pipe: reading initial cepstral mean/variance from file "ENVR-v5.3.norm"
Stat: wav2mfcc-pipe: reading HTK-format cepstral vectors
Stat: wav2mfcc-pipe: finished reading CMN/CVN parameter
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.5 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : WordsInt
 -  Compiled by  : gcc -O6 -fomit-frame-pointer -fPIC
Library configuration: version 4.5
 - Audio input
    primary A/D-in driver   : alsa (Advanced Linux Sound Architecture)
    available drivers       : alsa oss pulseaudio
    wavefile formats        : RAW and WAV only
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : integer (4 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN
    SSE AVX FMA
    FMA is available maximum on this cpu, use it

------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
    hmmfilename=ENVR-v5.3.am
    hmmmapfilename=ENVR-v5.3.phn

 Language Model:
 - LM00 "_default"
    vocabulary filename=ENVR-v5.3.dct
    n-gram  filename=ENVR-v5.3.lm (binary format)

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
        parameter = MFCC_E_D_A_Z (48 dim. from 15 cepstrum + energy with CMN)
        sample frequency = 16000 Hz
        sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
         pre-emphasis = 0.97
        # filterbank = 26
       cepst. lifter = 22
          raw energy = True
    energy normalize = True (scale = 0.1, silence floor = 50.0 dB)
        delta window = 2 frames (20.0 ms) around
          acc window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = ON
                VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, with per-utterance self mean
 cep. var. normalization = yes, with a static variance
static variance from file = ENVR-v5.3.norm

     base setup from = HTK Config (and HTK defaults)
      frame splicing = 11

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    15619 models, 7461 states, 7461 mpdfs, 119424 Gaussians are defined
          model type = context dependency handling ON
      training parameter = MFCC_E_D_A_Z
       vector length = 48
    number of stream = 1
         stream info = [0-47]
    cov. matrix type = DIAGC
       duration type = NULLD
    max mixture size = 32 Gaussians
         max length of model = 5 states
     logical base phones = 46
       model skip trans. = exist, require multi-path handling
      skippable models = sp (1 model(s))

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use max. prob. of same LC)
    sp transition penalty = -1.0

 DNN parameters:
          DNN input dim. = 528 (48 x 11)
         DNN output dim. = 7461
      # of hidden layers = 5
       hidden layer dim. = 1536
      state prior factor = 1.000000
   state prior log10nize = off
              batch size = 1
       number of threads = 1

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=n-gram

 N-gram info:
                spec = 3-gram, backward (right-to-left)
            OOV word = <unk>(id=0)
        wordset size = 262145
      1-gram entries =     262145  (  2.0 MB)
      2-gram entries =   16380163  (213.2 MB) (63% are valid contexts)
      3-gram entries =   51815890  (474.1 MB)
    LR 2-gram entries=   16380163  ( 63.5 MB)
               pass1 = given additional forward 2-gram

 Vocabulary Info:
        vocabulary size  = 319354 words, 2161689 models
        average word len = 6.8 models, 20.3 states
       maximum state num = 90 nodes per word
       transparent words = not exist
       words under class = not exist

 Parameters:
    (-silhead)head sil word = 1: "<s> @0.000000 [<s>] sil(sil)"
    (-siltail)tail sil word = 0: "</s> @0.000000 [</s>] sil(sil)"

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
      total node num = 3290483
      root node num =   1437
    (149 hi-freq. words are separated from tree lexicon)
      leaf node num = 319354
     fact. node num = 319354

  Inter-word N-gram cache: 
    root node to be cached = 263 / 1437 (isolated only)
    word ends to be cached = 262145 (all)
      max. allocation size = 275MB
    (-lmp)  pass1 LM weight = 12.0  ins. penalty = -6.0
    (-lmp2) pass2 LM weight = 12.0  ins. penalty = -6.0
    (-transp)trans. penalty = +0.0 per word
    (-cmalpha)CM alpha coef = 0.050000

     inter-word short pause = on (append "sp" for each word tail)
      sp transition penalty = -1.0
 Search parameters: 
         multi-path handling = yes, multi-path mode enabled
    (-b) trellis beam width = 4000
    (-bs)score pruning thres= disabled
    (-n)search candidate num= 40
    (-s)  search stack size = 2000
    (-m)    search overflow = after 8000 hypothesis poped
             2nd pass method = searching sentence, generating N-best
    (-b2)  pass2 beam width = 360
    (-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
    (-sb)2nd scan beamthres = 80.0 (in logscore)
    (-n)        search till = 40 candidates found
    (-output)    and output = 1 candidates out of above
     factoring score: 1-gram prob. (statically assigned beforehand)
     output word alignments
    short pause segmentation = on
          sp duration length = 10 frames
    fall back on search fail = on, adopt 1st pass result as final

------------------------------------------------------------
 Decoding algorithm:

    1st pass input processing = (forced) buffered, batch
    1st pass method = 1-best approx. generating indexed trellis
    output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                 input type = waveform
               input source = waveform file
              input filelist = test.dbl
              sampling freq. = 16000 Hz required
             threaded A/D-in = supported, off
       zero frames stripping = on
             silence cutting = on
                 level thres = 2000 / 32767
             zerocross thres = 60 / sec.
                 head margin = 300 msec.
                 tail margin = 400 msec.
              chunk size = 1000 samples
           FVAD switch value = -1 (disabled)
        long-term DC removal = off
        level scaling factor = 1.00 (disabled)
          reject short input = off
          reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
    *************************************************************
    * Cepstral mean and variance norm. for batch decoding:      *
    * constant mean and variance was loaded from file.          *
    * they will be applied constantly for all input.            *
    *************************************************************

------
### read waveform input
Stat: adin_file: input speechfile: mozilla.wav
Warning: strip: sample 212-232 has zero value, stripped
Warning: strip: sample 312-327 has zero value, stripped
Warning: strip: sample 391-406 has zero value, stripped
Warning: strip: sample 914-930 has zero value, stripped
Warning: strip: sample 51221-51244 has zero value, stripped
Warning: strip: sample 112765-112783 has zero value, stripped
Warning: strip: sample 113264-113279 has zero value, stripped
Warning: strip: sample 113394-113409 has zero value, stripped
Warning: strip: sample 113701-113719 has zero value, stripped
Warning: strip: sample 114939-114959 has zero value, stripped
Warning: strip: sample 115667-115682 has zero value, stripped
Warning: strip: sample 115932-115948 has zero value, stripped
Warning: strip: sample 116475-116490 has zero value, stripped
Warning: strip: sample 116605-116623 has zero value, stripped
Warning: strip: sample 117040-117055 has zero value, stripped
Warning: strip: sample 117490-117507 has zero value, stripped
Warning: strip: sample 868-884 has zero value, stripped
STAT: 50800 samples (3.17 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)

pass1_best: <s> without the data said the article was useless </s>
pass1_best_wordseq: <s> without the data said the article was useless </s>
pass1_best_phonemeseq: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w 
ax z | y uw s l ah s | sil
pass1_best_score: 282.374390
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 42566 generated, 7034 pushed, 454 nodes popped in 306
ALIGN: === word alignment begin ===
sentence1: <s> without the data said the article was useless </s>
wseq1: <s> without the data said the article was useless </s>
phseq1: sil | w ih dh aw t | dh ax | d ae t ah | s eh d | dh iy | aa r t ah k ah l | w ax z | y uw s l ah s | 
sil
cmscore1: 0.785 0.892 0.318 0.284 0.669 0.701 0.818 0.103 0.528 1.000
score1: 261.947144
=== begin forced alignment ===
-- word alignment --
 id: from  to    n_score    unit
 ----------------------------------------
[   0   17]  0.684685  <s>  [<s>]
[  18   51]  2.635743  without  [without]
[  52   62]  1.305501  the  [the]
[  63   91]  2.258720  data [data]
[  92  129]  2.324036  said [said]
[ 130  138]  2.702694  the  [the]
[ 139  173]  2.214880  article  [article]
[ 174  194]  1.476129  was  [was]
[ 195  264]  2.325301  useless  [useless]
[ 265  305]  0.749876  </s> [</s>]
re-computed AM score: 593.854553
=== end forced alignment ===

STAT: 24800 samples (1.55 sec.)
STAT: ### speech analysis (waveform -> MFCC)
### Recognition: 1st pass (LR beam)

pass1_best: <s> i've got go to him
pass1_best_wordseq: <s> i've got go to him
pass1_best_phonemeseq: sil | ah ih b | g aa t | g ow | t ah | hh ih m
pass1_best_score: 120.275803
### Recognition: 2nd pass (RL heuristic best-first)
Segmentation fault (core dumped)
zdomjus60 commented 4 years ago

@marcoippolito Uhm, ok. Must search another solution. I attach my full dump for you it's Boris Johnson's speech as Prime Minister, I'm not a fan but ... just as a sample :-). Think of it as a successful transcription (yeah, many words wrong (god bless the green), but .. julius has its limits). I'm curious about 2nd pass. speech.txt

dlmiles commented 4 years ago

https://www.youtube.com/watch?v=YypKBfFtovU (this looks like the audio content)

CaptainBloodz commented 1 year ago

Unsure if related but here is

### Recognition: 2nd pass (RL heuristic best-first) Segmentation fault

when using -flto. (-O2, gcc:13.2.1 -march=skylake; more build details upon request)

Smooth as silk otherwise when applying README.md alike wav file testing

CaptainBloodz commented 1 year ago

Just tested -flto -O1 works fine here. So testing -O1 to -O2 additional features one by one could help narrowing the problem origin here.