sanphiee / LPLDA

Local Pairwise Linear Discriminant Analysis
Apache License 2.0
8 stars 7 forks source link

Process is getting killed #1

Open rameshkunasi opened 4 years ago

rameshkunasi commented 4 years ago

Hi,

I am using a 200K utterance to train LDA. While training LDA CPU RAM getting full and the process was killed. My CPU RAM is 8GB & 2GB swap memory. How to train LDA with a large amount of data?

sanphiee commented 4 years ago

sorry for the late reply 😀

  1. what is the dimension of your vector? Since we have to get the eigenvectors by solving the generalized eigen problem, the memory we used is closely related to this dimension.
  2. We find that the methods of solving eigen problem in Python have different memory requirements. The svd method consumes the least memory in our experiments.
  3. Actually, the memory we used has little correlation with the number of training samples. You can get the within and between covariance matrix by batch operation.

    Regards, Yours

From my iPhone

在 2019年11月25日,21:06,Kunasi Ramesh notifications@github.com 写道:

 Hi,

I am using a 200K utterance to train LDA. While training LDA CPU RAM getting full and the process was killed. My CPU RAM is 8GB & 2GB swap memory. How to train LDA with a large amount of data?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

rameshkunasi commented 4 years ago

Thank you for your reply 👍 Below are my answers to your questions.

  1. The dimension of each vector is 512
  2. While training using LDA.py I got the error. You are using only eigenvectors to train the LDA matrix. I have not seen SVD implementation in LDA.py
  3. I am using 200K training example to train LDA

I have trained LDA using LDA.py for 200K training samples. I have saved self.scalings_ into kaldi format as transform.mat. Can you please suggest me how can I use transform.mat to train PLDA in kaldi using ivector-compute-plda? Or How to train PLDA in Python, If you have any script please provide to me.

Thanks K.Ramesh

sanphiee commented 4 years ago

While training using LDA.py I got the error. You are using only eigenvectors to train the LDA matrix. I have not seen SVD implementation in LDA.py

  1. If you want to use LDA alone, you can try this (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.discriminant_analysis).

My code is a revision of the above one. The original code has the option of SVD.

  1. If you are familiar with kaldi, you can try the attachments.

Yours sincerely,

He Liang,

Rohm Building 8101,

Department of Electronic Engineering, Tsinghua University,

Beijing, 10084, China

发件人: noreply@github.com noreply@github.com 代表 Kunasi Ramesh 发送时间: Sunday, December 1, 2019 10:56 PM 收件人: sanphiee/LPLDA LPLDA@noreply.github.com 抄送: He Liang heliang@mail.tsinghua.edu.cn; Comment comment@noreply.github.com 主题: Re: [sanphiee/LPLDA] Process is getting killed (#1)

Thank you for your reply 👍 Below are my answers to your questions.

  1. The dimension of each vector is 512
  2. While training using LDA.py I got the error. You are using only eigenvectors to train the LDA matrix. I have not seen SVD implementation in LDA.py
  3. I am using 200K training example to train LDA

I have trained LDA using LDA.py for 200K training samples. I have saved self.scalings_ into kaldi format as transform.mat. Can you please suggest me how can I use transform.mat to train PLDA in kaldi using ivector-compute-plda? Or How to train PLDA in Python, If you have any script please provide to me.

Thanks K.Ramesh

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/sanphiee/LPLDA/issues/1?email_source=notifications&email_token=AFU2POLRF6ZER4NKMXO3IMTQWPF7VA5CNFSM4JRIT2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFRLOSA#issuecomment-560117576 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AFU2POKBK7JGGFUXYD26SMTQWPF7VANCNFSM4JRIT2GA .

!/usr/bin/env python

-- coding: utf-8 --

Copyright 2014-2019 Brno University of Technology (author: Karel Vesely)

Licensed under the Apache License, Version 2.0 (the "License")

from future import print_function from future import division

import numpy as np import sys, os, re, gzip, struct

#################################################

Adding kaldi tools to shell path,

Select kaldi,

if not 'KALDI_ROOT' in os.environ:

Default! To change run python with 'export KALDI_ROOT=/some_dir python'

os.environ['KALDI_ROOT']='/mnt/matylda5/iveselyk/Tools/kaldi-trunk'

Add kaldi tools to path,

path = os.popen('echo $KALDI_ROOT/src/bin:$KALDI_ROOT/tools/openfst/bin:$KALDI_ROOT/src/fstbin/:$KALDI_ROOT/src/gmmbin/:$KALDI_ROOT/src/featbin/:$KALDI_ROOT/src/lm/:$KALDI_ROOT/src/sgmmbin/:$KALDI_ROOT/src/sgmm2bin/:$KALDI_ROOT/src/fgmmbin/:$KALDI_ROOT/src/latbin/:$KALDI_ROOT/src/nnetbin:$KALDI_ROOT/src/nnet2bin:$KALDI_ROOT/src/nnet3bin:$KALDI_ROOT/src/online2bin/:$KALDI_ROOT/src/ivectorbin/:$KALDI_ROOT/src/lmbin/') os.environ['PATH'] = path.readline().strip() + ':' + os.environ['PATH'] path.close()

#################################################

Define all custom exceptions,

class UnsupportedDataType(Exception): pass class UnknownVectorHeader(Exception): pass class UnknownMatrixHeader(Exception): pass

class BadSampleSize(Exception): pass class BadInputFormat(Exception): pass

class SubprocessFailed(Exception): pass

#################################################

Data-type independent helper functions,

def open_or_fd(file, mode='rb'): """ fd = open_or_fd(file) Open file, gzipped file, pipe, or forward the file-descriptor. Eventually seeks in the 'file' argument contains ':offset' suffix. """ offset = None try:

strip 'ark:' prefix from r{x,w}filename (optional),

    if re.search('^(ark|scp)(,scp|,b|,t|,n?f|,n?p|,b?o|,n?s|,n?cs)*:', file):
        (prefix,file) = file.split(':',1)
    # separate offset from filename (optional),
    if re.search(':[0-9]+$', file):
        (file,offset) = file.rsplit(':',1)
    # input pipe?
    if file[-1] == '|':
        fd = popen(file[:-1], 'rb') # custom,
    # output pipe?
    elif file[0] == '|':
        fd = popen(file[1:], 'wb') # custom,
    # is it gzipped?
    elif file.split('.')[-1] == 'gz':
        fd = gzip.open(file, mode)
    # a normal file...
    else:
        fd = open(file, mode)
except TypeError:
    # 'file' is opened file descriptor,
    fd = file
# Eventually seek to offset,
if offset != None: fd.seek(int(offset))
return fd

based on '/usr/local/lib/python3.6/os.py'

def popen(cmd, mode="rb"): if not isinstance(cmd, str): raise TypeError("invalid cmd type (%s, expected string)" % type(cmd))

import subprocess, io, threading

# cleanup function for subprocesses,
def cleanup(proc, cmd):
    ret = proc.wait()
    if ret > 0:
        raise SubprocessFailed('cmd %s returned %d !' % (cmd,ret))
    return

# text-mode,
if mode == "r":
    proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=sys.stderr)
    threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
    return io.TextIOWrapper(proc.stdout)
elif mode == "w":
    proc = subprocess.Popen(cmd, shell=True, stdin=subprocess.PIPE, stderr=sys.stderr)
    threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
    return io.TextIOWrapper(proc.stdin)
# binary,
elif mode == "rb":
    proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=sys.stderr)
    threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
    return proc.stdout
elif mode == "wb":
    proc = subprocess.Popen(cmd, shell=True, stdin=subprocess.PIPE, stderr=sys.stderr)
    threading.Thread(target=cleanup,args=(proc,cmd)).start() # clean-up thread,
    return proc.stdin
# sanity,
else:
    raise ValueError("invalid mode %s" % mode)

def read_key(fd): """ [key] = read_key(fd) Read the utterance-key from the opened ark/stream descriptor 'fd'. """ assert('b' in fd.mode), "Error: 'fd' was opened in text mode (in python3 use sys.stdin.buffer)"

key = ''
while 1:
    char = fd.read(1).decode("latin1")
    if char == '' : break
    if char == ' ' : break
    key += char
key = key.strip()
if key == '': return None # end of file,
assert(re.match('^\S+$',key) != None) # check format (no whitespace!)
return key

#################################################

Integer vectors (alignments, ...),

def read_ali_ark(file_or_fd): """ Alias to 'read_vec_int_ark()' """ return read_vec_int_ark(file_or_fd)

def read_vec_int_ark(file_or_fd): """ generator(key,vec) = read_vec_int_ark(file_or_fd) Create generator of (key,vector) tuples, which reads from the ark file/stream. file_or_fd : ark, gzipped ark, pipe or opened file descriptor.

 Read ark to a 'dictionary':
 d = { u:d for u,d in kaldi_io.read_vec_int_ark(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    key = read_key(fd)
    while key:
        ali = read_vec_int(fd)
        yield key, ali
        key = read_key(fd)
finally:
    if fd is not file_or_fd: fd.close()

def read_vec_int(file_or_fd): """ [int-vec] = read_vec_int(file_or_fd) Read kaldi integer vector, ascii or binary input, """ fd = open_or_fd(file_or_fd) binary = fd.read(2).decode() if binary == '\0B': # binary flag assert(fd.read(1).decode() == '\4'); # int-size vec_size = np.frombuffer(fd.read(4), dtype='int32', count=1)[0] # vector dim if vec_size == 0: return np.array([], dtype='int32')

Elements from int32 vector are sored in tuples: (sizeof(int32), value),

    vec = np.frombuffer(fd.read(vec_size*5), dtype=[('size','int8'),('value','int32')], count=vec_size)
    assert(vec[0]['size'] == 4) # int32 size,
    ans = vec[:]['value'] # values are in 2nd column,
else: # ascii,
    arr = (binary + fd.readline().decode()).strip().split()
    try:
        arr.remove('['); arr.remove(']') # optionally
    except ValueError:
        pass
    ans = np.array(arr, dtype=int)
if fd is not file_or_fd : fd.close() # cleanup
return ans

Writing,

def write_vec_int(file_or_fd, v, key=''): """ write_vec_int(f, v, key='') Write a binary kaldi integer vector to filename or stream. Arguments: file_or_fd : filename or opened file descriptor for writing, v : the vector to be stored, key (optional) : used for writing ark-file, the utterance-id gets written before the vector.

 Example of writing single vector:
 kaldi_io.write_vec_int(filename, vec)

 Example of writing arkfile:
 with open(ark_file,'w') as f:
     for key,vec in dict.iteritems():
         kaldi_io.write_vec_flt(f, vec, key=key)
"""
assert(isinstance(v, np.ndarray))
assert(v.dtype == np.int32)
fd = open_or_fd(file_or_fd, mode='wb')
if sys.version_info[0] == 3: assert(fd.mode == 'wb')
try:
    if key != '' : fd.write((key+' ').encode("latin1")) # ark-files have keys (utterance-id),
    fd.write('\0B'.encode()) # we write binary!
    # dim,
    fd.write('\4'.encode()) # int32 type,
    fd.write(struct.pack(np.dtype('int32').char, v.shape[0]))
    # data,
    for i in range(len(v)):
        fd.write('\4'.encode()) # int32 type,
        fd.write(struct.pack(np.dtype('int32').char, v[i])) # binary,
finally:
    if fd is not file_or_fd : fd.close()

#################################################

Float vectors (confidences, ivectors, ...),

Reading,

def read_vec_flt_scp(file_or_fd): """ generator(key,mat) = read_vec_flt_scp(file_or_fd) Returns generator of (key,vector) tuples, read according to kaldi scp. file_or_fd : scp, gzipped scp, pipe or opened file descriptor.

 Iterate the scp:
 for key,vec in kaldi_io.read_vec_flt_scp(file):
     ...

 Read scp to a 'dictionary':
 d = { key:mat for key,mat in kaldi_io.read_mat_scp(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    for line in fd:
        (key,rxfile) = line.decode().split(' ')
        vec = read_vec_flt(rxfile)
        yield key, vec
finally:
    if fd is not file_or_fd : fd.close()

def read_vec_flt_ark(file_or_fd): """ generator(key,vec) = read_vec_flt_ark(file_or_fd) Create generator of (key,vector) tuples, reading from an ark file/stream. file_or_fd : ark, gzipped ark, pipe or opened file descriptor.

 Read ark to a 'dictionary':
 d = { u:d for u,d in kaldi_io.read_vec_flt_ark(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    key = read_key(fd)
    while key:
        ali = read_vec_flt(fd)
        yield key, ali
        key = read_key(fd)
finally:
    if fd is not file_or_fd : fd.close()

def read_vec_flt(file_or_fd): """ [flt-vec] = read_vec_flt(file_or_fd) Read kaldi float vector, ascii or binary input, """ fd = open_or_fd(file_or_fd) binary = fd.read(2).decode() if binary == '\0B': # binary flag ans = _read_vec_flt_binary(fd) else: # ascii, arr = (binary + fd.readline().decode()).strip().split() try: arr.remove('['); arr.remove(']') # optionally except ValueError: pass ans = np.array(arr, dtype=float) if fd is not file_or_fd : fd.close() # cleanup return ans

def _read_vec_flt_binary(fd): header = fd.read(3).decode() if header == 'FV ' : sample_size = 4 # floats elif header == 'DV ' : sample_size = 8 # doubles else : raise UnknownVectorHeader("The header contained '%s'" % header) assert (sample_size > 0)

Dimension,

assert (fd.read(1).decode() == '\4'); # int-size
vec_size = np.frombuffer(fd.read(4), dtype='int32', count=1)[0] # vector dim
if vec_size == 0:
    return np.array([], dtype='float32')
# Read whole vector,
buf = fd.read(vec_size * sample_size)
if sample_size == 4 : ans = np.frombuffer(buf, dtype='float32')
elif sample_size == 8 : ans = np.frombuffer(buf, dtype='float64')
else : raise BadSampleSize
return ans

Writing,

def write_vec_flt(file_or_fd, v, key=''): """ write_vec_flt(f, v, key='') Write a binary kaldi vector to filename or stream. Supports 32bit and 64bit floats. Arguments: file_or_fd : filename or opened file descriptor for writing, v : the vector to be stored, key (optional) : used for writing ark-file, the utterance-id gets written before the vector.

 Example of writing single vector:
 kaldi_io.write_vec_flt(filename, vec)

 Example of writing arkfile:
 with open(ark_file,'w') as f:
     for key,vec in dict.iteritems():
         kaldi_io.write_vec_flt(f, vec, key=key)
"""
assert(isinstance(v, np.ndarray))
fd = open_or_fd(file_or_fd, mode='wb')
if sys.version_info[0] == 3: assert(fd.mode == 'wb')
try:
    if key != '' : fd.write((key+' ').encode("latin1")) # ark-files have keys (utterance-id),
    fd.write('\0B'.encode()) # we write binary!
    # Data-type,
    if v.dtype == 'float32': fd.write('FV '.encode())
    elif v.dtype == 'float64': fd.write('DV '.encode())
    else: raise UnsupportedDataType("'%s', please use 'float32' or 'float64'" % v.dtype)
    # Dim,
    fd.write('\04'.encode())
    fd.write(struct.pack(np.dtype('uint32').char, v.shape[0])) # dim
    # Data,
    fd.write(v.tobytes())
finally:
    if fd is not file_or_fd : fd.close()

#################################################

Float matrices (features, transformations, ...),

Reading,

def read_mat_scp(file_or_fd): """ generator(key,mat) = read_mat_scp(file_or_fd) Returns generator of (key,matrix) tuples, read according to kaldi scp. file_or_fd : scp, gzipped scp, pipe or opened file descriptor.

 Iterate the scp:
 for key,mat in kaldi_io.read_mat_scp(file):
     ...

 Read scp to a 'dictionary':
 d = { key:mat for key,mat in kaldi_io.read_mat_scp(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    for line in fd:
        (key,rxfile) = line.decode().split(' ')
        mat = read_mat(rxfile)
        yield key, mat
finally:
    if fd is not file_or_fd : fd.close()

def read_mat_ark(file_or_fd): """ generator(key,mat) = read_mat_ark(file_or_fd) Returns generator of (key,matrix) tuples, read from ark file/stream. file_or_fd : scp, gzipped scp, pipe or opened file descriptor.

 Iterate the ark:
 for key,mat in kaldi_io.read_mat_ark(file):
     ...

 Read ark to a 'dictionary':
 d = { key:mat for key,mat in kaldi_io.read_mat_ark(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    key = read_key(fd)
    while key:
        mat = read_mat(fd)
        yield key, mat
        key = read_key(fd)
finally:
    if fd is not file_or_fd : fd.close()

def read_mat(file_or_fd): """ [mat] = read_mat(file_or_fd) Reads single kaldi matrix, supports ascii and binary. file_or_fd : file, gzipped file, pipe or opened file descriptor. """ fd = open_or_fd(file_or_fd) try: binary = fd.read(2).decode() if binary == '\0B' : mat = _read_mat_binary(fd) else: assert(binary == ' [') mat = _read_mat_ascii(fd) finally: if fd is not file_or_fd: fd.close() return mat

def _read_mat_binary(fd):

Data type

header = fd.read(3).decode()
# 'CM', 'CM2', 'CM3' are possible values,
if header.startswith('CM'): return _read_compressed_mat(fd, header)
elif header == 'FM ': sample_size = 4 # floats
elif header == 'DM ': sample_size = 8 # doubles
else: raise UnknownMatrixHeader("The header contained '%s'" % header)
assert(sample_size > 0)
# Dimensions
s1, rows, s2, cols = np.frombuffer(fd.read(10), dtype='int8,int32,int8,int32', count=1)[0]
# Read whole matrix
buf = fd.read(rows * cols * sample_size)
if sample_size == 4 : vec = np.frombuffer(buf, dtype='float32')
elif sample_size == 8 : vec = np.frombuffer(buf, dtype='float64')
else : raise BadSampleSize
mat = np.reshape(vec,(rows,cols))
return mat

def _read_mat_ascii(fd): rows = [] while 1: line = fd.readline().decode() if (len(line) == 0) : raise BadInputFormat # eof, should not happen! if len(line.strip()) == 0 : continue # skip empty line arr = line.strip().split() if arr[-1] != ']': rows.append(np.array(arr,dtype='float32')) # not last line else: rows.append(np.array(arr[:-1],dtype='float32')) # last line mat = np.vstack(rows) return mat

def _read_compressed_mat(fd, format): """ Read a compressed matrix, see: https://github.com/kaldi-asr/kaldi/blob/master/src/matrix/compressed-matrix.h methods: CompressedMatrix::Read(...), CompressedMatrix::CopyToMat(...), """ assert(format == 'CM ') # The formats CM2, CM3 are not supported...

# Format of header 'struct',
global_header = np.dtype([('minvalue','float32'),('range','float32'),('num_rows','int32'),('num_cols','int32')]) # member '.format' is not written,
per_col_header = np.dtype([('percentile_0','uint16'),('percentile_25','uint16'),('percentile_75','uint16'),('percentile_100','uint16')])

# Read global header,
globmin, globrange, rows, cols = np.frombuffer(fd.read(16), dtype=global_header, count=1)[0]

# The data is structed as [Colheader, ... , Colheader, Data, Data , .... ]
#                                                 {                     cols                     }{         size                 }
col_headers = np.frombuffer(fd.read(cols*8), dtype=per_col_header, count=cols)
col_headers = np.array([np.array([x for x in y]) * globrange * 1.52590218966964e-05 + globmin for y in col_headers], dtype=np.float32)
data = np.reshape(np.frombuffer(fd.read(cols*rows), dtype='uint8', count=cols*rows), newshape=(cols,rows)) # stored as col-major,

mat = np.zeros((cols,rows), dtype='float32')
p0 = col_headers[:, 0].reshape(-1, 1)
p25 = col_headers[:, 1].reshape(-1, 1)
p75 = col_headers[:, 2].reshape(-1, 1)
p100 = col_headers[:, 3].reshape(-1, 1)
mask_0_64 = (data <= 64)
mask_193_255 = (data > 192)
mask_65_192 = (~(mask_0_64 | mask_193_255))

mat += (p0    + (p25 - p0) / 64. * data) * mask_0_64.astype(np.float32)
mat += (p25 + (p75 - p25) / 128. * (data - 64)) * mask_65_192.astype(np.float32)
mat += (p75 + (p100 - p75) / 63. * (data - 192)) * mask_193_255.astype(np.float32)

return mat.T # transpose! col-major -> row-major,

Writing,

def write_mat(file_or_fd, m, key=''): """ write_mat(f, m, key='') Write a binary kaldi matrix to filename or stream. Supports 32bit and 64bit floats. Arguments: file_or_fd : filename of opened file descriptor for writing, m : the matrix to be stored, key (optional) : used for writing ark-file, the utterance-id gets written before the matrix.

 Example of writing single matrix:
 kaldi_io.write_mat(filename, mat)

 Example of writing arkfile:
 with open(ark_file,'w') as f:
     for key,mat in dict.iteritems():
         kaldi_io.write_mat(f, mat, key=key)
"""
assert(isinstance(m, np.ndarray))
assert(len(m.shape) == 2), "'m' has to be 2d matrix!"
fd = open_or_fd(file_or_fd, mode='wb')
if sys.version_info[0] == 3: assert(fd.mode == 'wb')
try:
    if key != '' : fd.write((key+' ').encode("latin1")) # ark-files have keys (utterance-id),
    fd.write('\0B'.encode()) # we write binary!
    # Data-type,
    if m.dtype == 'float32': fd.write('FM '.encode())
    elif m.dtype == 'float64': fd.write('DM '.encode())
    else: raise UnsupportedDataType("'%s', please use 'float32' or 'float64'" % m.dtype)
    # Dims,
    fd.write('\04'.encode())
    fd.write(struct.pack(np.dtype('uint32').char, m.shape[0])) # rows
    fd.write('\04'.encode())
    fd.write(struct.pack(np.dtype('uint32').char, m.shape[1])) # cols
    # Data,
    fd.write(m.tobytes())
finally:
    if fd is not file_or_fd : fd.close()

#################################################

'Posterior' kaldi type (posteriors, confusion network, nnet1 training targets, ...)

Corresponds to: vector<vector<tuple<int,float> > >

- outer vector: time axis

- inner vector: records at the time

- tuple: int = index, float = value

#

def read_cnet_ark(file_or_fd): """ Alias of function 'read_post_ark()', 'cnet' = confusion network """ return read_post_ark(file_or_fd)

def read_postrxspec(file): """ adaptor to read both 'ark:...' and 'scp:...' inputs of posteriors, """ if file_.startswith("ark:"): return read_postark(file) elif file_.startswith("scp:"): return read_postscp(file) else: print("unsupported intput type: %s" % file_) print("it should begint with 'ark:' or 'scp:'") sys.exit(1)

def read_post_scp(file_or_fd): """ generator(key,post) = read_post_scp(file_or_fd) Returns generator of (key,post) tuples, read according to kaldi scp. file_or_fd : scp, gzipped scp, pipe or opened file descriptor.

 Iterate the scp:
 for key,post in kaldi_io.read_post_scp(file):
     ...

 Read scp to a 'dictionary':
 d = { key:post for key,post in kaldi_io.read_post_scp(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    for line in fd:
        (key,rxfile) = line.decode().split(' ')
        post = read_post(rxfile)
        yield key, post
finally:
    if fd is not file_or_fd : fd.close()

def read_post_ark(file_or_fd): """ generator(key,vec<vec<int,float>>) = read_post_ark(file) Returns generator of (key,posterior) tuples, read from ark file. file_or_fd : ark, gzipped ark, pipe or opened file descriptor.

 Iterate the ark:
 for key,post in kaldi_io.read_post_ark(file):
     ...

 Read ark to a 'dictionary':
 d = { key:post for key,post in kaldi_io.read_post_ark(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    key = read_key(fd)
    while key:
        post = read_post(fd)
        yield key, post
        key = read_key(fd)
finally:
    if fd is not file_or_fd: fd.close()

def read_post(file_or_fd): """ [post] = read_post(file_or_fd) Reads single kaldi 'Posterior' in binary format.

 The 'Posterior' is C++ type 'vector<vector<tuple<int,float> > >',
 the outer-vector is usually time axis, inner-vector are the records
 at given time,    and the tuple is composed of an 'index' (integer)
 and a 'float-value'. The 'float-value' can represent a probability
 or any other numeric value.

 Returns vector of vectors of tuples.
"""
fd = open_or_fd(file_or_fd)
ans=[]
binary = fd.read(2).decode(); assert(binary == '\0B'); # binary flag
assert(fd.read(1).decode() == '\4'); # int-size
outer_vec_size = np.frombuffer(fd.read(4), dtype='int32', count=1)[0] # number of frames (or bins)

# Loop over 'outer-vector',
for i in range(outer_vec_size):
    assert(fd.read(1).decode() == '\4'); # int-size
    inner_vec_size = np.frombuffer(fd.read(4), dtype='int32', count=1)[0] # number of records for frame (or bin)
    data = np.frombuffer(fd.read(inner_vec_size*10), dtype=[('size_idx','int8'),('idx','int32'),('size_post','int8'),('post','float32')], count=inner_vec_size)
    assert(data[0]['size_idx'] == 4)
    assert(data[0]['size_post'] == 4)
    ans.append(data[['idx','post']].tolist())

if fd is not file_or_fd: fd.close()
return ans

#################################################

Kaldi Confusion Network bin begin/end times,

(kaldi stores CNs time info separately from the Posterior).

#

def read_cntime_ark(file_or_fd): """ generator(key,vec<tuple<float,float>>) = read_cntime_ark(file_or_fd) Returns generator of (key,cntime) tuples, read from ark file. file_or_fd : file, gzipped file, pipe or opened file descriptor.

 Iterate the ark:
 for key,time in kaldi_io.read_cntime_ark(file):
     ...

 Read ark to a 'dictionary':
 d = { key:time for key,time in kaldi_io.read_post_ark(file) }
"""
fd = open_or_fd(file_or_fd)
try:
    key = read_key(fd)
    while key:
        cntime = read_cntime(fd)
        yield key, cntime
        key = read_key(fd)
finally:
    if fd is not file_or_fd : fd.close()

def read_cntime(file_or_fd): """ [cntime] = read_cntime(file_or_fd) Reads single kaldi 'Confusion Network time info', in binary format: C++ type: vector<tuple<float,float> >. (begin/end times of bins at the confusion network).

 Binary layout is '<num-bins> <beg1> <end1> <beg2> <end2> ...'

 file_or_fd : file, gzipped file, pipe or opened file descriptor.

 Returns vector of tuples.
"""
fd = open_or_fd(file_or_fd)
binary = fd.read(2).decode(); assert(binary == '\0B'); # assuming it's binary

assert(fd.read(1).decode() == '\4'); # int-size
vec_size = np.frombuffer(fd.read(4), dtype='int32', count=1)[0] # number of frames (or bins)

data = np.frombuffer(fd.read(vec_size*10), dtype=[('size_beg','int8'),('t_beg','float32'),('size_end','int8'),('t_end','float32')], count=vec_size)
assert(data[0]['size_beg'] == 4)
assert(data[0]['size_end'] == 4)
ans = data[['t_beg','t_end']].tolist() # Return vector of tuples (t_beg,t_end),

if fd is not file_or_fd : fd.close()
return ans

#################################################

Segments related,

#

Segments as 'Bool vectors' can be handy,

- for 'superposing' the segmentations,

- for frame-selection in Speaker-ID experiments,

def read_segments_as_bool_vec(segments_file): """ [ bool_vec ] = read_segments_as_bool_vec(segments_file) using kaldi 'segments' file for 1 wav, format : ' '

-- coding: utf-8 --

from future import print_function import numpy as np from scipy import linalg from sklearn.utils.multiclass import unique_labels from sklearn.utils import check_array, check_X_y from sklearn.utils.validation import check_is_fitted

==========================================================================

reviser : Liang He

descrption : linear discriminant analysis

revised from sklearn

created : 20180613

revised :

Liang He, +86-13426228839, heliang@mail.tsinghua.edu.cn

Aurora Lab, Department of Electronic Engineering, Tsinghua University

==========================================================================

all = ['LinearDiscriminantAnalysis']

def _cov(X): """Estimate covariance matrix. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
Returns
-------
s : array, shape (n_features, n_features)
    Estimated covariance matrix.
"""
s = np.cov(X, rowvar=0, bias = 1)
return s

def _class_means(X, y): """Compute class means. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
    Target values.
Returns
-------
means : array-like, shape (n_features,)
    Class means.
"""
means = []
classes = np.unique(y)
for group in classes:
    Xg = X[y == group, :]
    means.append(Xg.mean(0))
return np.asarray(means)

def _class_cov(X, y): """Compute class covariance matrix. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
    Target values.
shrinkage : string or float, optional
    Shrinkage parameter, possible values:
      - None: no shrinkage (default).
      - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.
      - float between 0 and 1: fixed shrinkage parameter.
Returns
-------
cov : array-like, shape (n_features, n_features)
    Class covariance matrix.
"""
classes = np.unique(y)
covs = []
for group in classes:
    Xg = X[y == group, :]
    covs.append(np.atleast_2d(_cov(Xg)))
return np.average(covs, axis=0)

class LinearDiscriminantAnalysis:

def __init__(self, n_components=None, within_between_ratio=10.0, 
             nearest_neighbor_ratio=1.2):
    self.n_components = n_components
    self.within_between_ratio = within_between_ratio
    self.nearest_neighbor_ratio = nearest_neighbor_ratio

def _solve_eigen(self, X, y):
    """Eigenvalue solver.
    The eigenvalue solver computes the optimal solution of the Rayleigh
    coefficient (basically the ratio of between class scatter to within
    class scatter). This solver supports both classification and
    dimensionality reduction (with optional shrinkage).
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data.
    y : array-like, shape (n_samples,) or (n_samples, n_targets)
        Target values.
    Notes
    -----
    This solver is based on [1]_, section 3.8.3, pp. 121-124.
    References
    ----------
    .. [1] R. O. Duda, P. E. Hart, D. G. Stork. Pattern Classification
       (Second Edition). John Wiley & Sons, Inc., New York, 2001. ISBN
       0-471-05669-3.
    """
    self.means_ = _class_means(X, y)        
    self.covariance_ = _class_cov(X, y)

    Sw = self.covariance_  # within scatter
    St = _cov(X)  # total scatter
    Sb = St - Sw  # between scatter

    evals, evecs = linalg.eigh(Sb, Sw)        
    evecs = evecs[:, np.argsort(evals)[::-1]]  # sort eigenvectors
    self.scalings_ = np.asarray(evecs)

def fit(self, X, y):
    """Fit Local Pairwise Trained Linear Discriminant Analysis 
       model according to the given training data and parameters.
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data.
    y : array, shape (n_samples,)
        Target values.
    """

    X, y = check_X_y(np.asarray(X), np.asarray(y.reshape(-1)), ensure_min_samples=2)
    self.classes_ = unique_labels(y)

    # Get the maximum number of components
    if self.n_components is None:
        self.n_components = len(self.classes_) - 1
    else:
        self.n_components = min(len(self.classes_) - 1, self.n_components)

    self._solve_eigen(np.asarray(X), np.asarray(y))
    return self

def transform(self, X):
    """Project data to maximize class separation.
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Input data.
    Returns
    -------
    X_new : array, shape (n_samples, n_components)
        Transformed data.
    """
    check_is_fitted(self, ['scalings_'], all_or_any=any)
    X = check_array(X)
    X_new = np.dot(X, self.scalings_)
    return X_new[:, :self.n_components]

if name == 'main':

samples = 20
dim = 6
lda_dim = 3

data = np.random.random((samples, dim))  
label = np.random.random_integers(0, 2, size=(samples, 1))

lda = LinearDiscriminantAnalysis(lda_dim)
lda.fit(data, label)
lda_data = lda.transform(data)

print (lda_data)

-- coding: utf-8 --

from future import print_function import numpy as np from scipy import linalg from sklearn.utils.multiclass import unique_labels from sklearn.utils import check_array, check_X_y from sklearn.utils.validation import check_is_fitted import LDA import sys import kaldi_io

==========================================================================

author : Liang He

descrption : local pairwise linear discriminant analysis

revised from sklearn

created : 20180613

revised :

Liang He, +86-13426228839, heliang@mail.tsinghua.edu.cn

Aurora Lab, Department of Electronic Engineering, Tsinghua University

==========================================================================

all = ['LocalPairwiseTrainedLinearDiscriminantAnalysis']

def _cov(X): """Estimate covariance matrix. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
Returns
-------
s : array, shape (n_features, n_features)
    Estimated covariance matrix.
"""
s = np.cov(X, rowvar=0, bias = 1)    
return s

def _similarity_function(mean_vec, vecs):

dot_kernel = np.array([np.dot(mean_vec, vecs) for i in range(0,len(vecs))])

return dot_kernel

mean_vec_norm = mean_vec / np.sqrt(np.sum(mean_vec ** 2))
vecs_norm = vecs / np.sqrt(np.sum(vecs ** 2, axis=1))[:, np.newaxis]
cosine_kernel = np.array([np.dot(mean_vec_norm, vecs_norm[i]) for i in range(len(vecs_norm))])
return cosine_kernel

def _class_means_and_neighbor_means(X, y, k1, k2): """Compute class means and neighor means Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
    Target values.
k1: within_between_ratio
k2: nearest_neighbor_ratio
Returns
-------
means : array-like, shape (n_features,)
    Class means and neighbor means
"""
means = []
neighbor_means = []

classes = np.unique(y)
samples = np.size(y)

for group in classes:
    Xg = X[y == group, :]
    Xg_count = Xg.shape[0]
    Xg_mean = Xg.mean(0)
    Xn = X[y != group, :]
    Xg_similarity = _similarity_function(Xg_mean, Xg)
    Xg_similarity_min = min(Xg_similarity)
    Xn_similarity = _similarity_function(Xg_mean, Xn)
    Xn_neighbor_count = len(Xn_similarity[Xn_similarity > Xg_similarity_min])
    Xn_neighbor_count = int(max(k1 * Xg_count, k2 * Xn_neighbor_count))
    Xn_neighbor_count = min(Xn_neighbor_count, samples - Xg_count)
    Xn_label = np.argsort(Xn_similarity)
    Xn_label = Xn_label[::-1]
    Xg_neighbor = np.array([Xn[Xn_label[i]] for i in range(Xn_neighbor_count)])
    Xg_neighbor_mean = Xg_neighbor.mean(0)

    means.append(Xg_mean)
    neighbor_means.append(Xg_neighbor_mean)

return np.array(means), np.array(neighbor_means)

def _class_cov(X, y): """Compute class covariance matrix. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
    Target values.
shrinkage : string or float, optional
    Shrinkage parameter, possible values:
      - None: no shrinkage (default).
      - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.
      - float between 0 and 1: fixed shrinkage parameter.
Returns
-------
cov : array-like, shape (n_features, n_features)
    Class covariance matrix.
"""
classes = np.unique(y)
covs = []
for group in classes:
    Xg = X[y == group, :]
    covs.append(np.atleast_2d(_cov(Xg)))
return np.average(covs, axis=0)

def _local_pairwise_cov(class_mean, neighbor_mean): """Estimate local pairwise matrix. Parameters

class_mean : array-like, shape (n_samples, n_features)
             each class mean
neighbor_mean: array-like, shape (n_samples, n_features)
             each class neighbor mean
Returns
-------
s : array, shape (n_features, n_features)
    Estimated covariance matrix.
"""
covs = []
for i in range(0, len(class_mean)):
    local_pair = np.vstack((class_mean[i], neighbor_mean[i]))
    covs.append(np.atleast_2d(_cov(local_pair)))
return np.average(covs, axis=0)

class LocalPairwiseLinearDiscriminantAnalysis:

def __init__(self, n_components=None, within_between_ratio=10.0, 
             nearest_neighbor_ratio=1.2):
    self.n_components = n_components
    self.within_between_ratio = within_between_ratio
    self.nearest_neighbor_ratio = nearest_neighbor_ratio

def _solve_eigen(self, X, y):
    """Eigenvalue solver.
    The eigenvalue solver computes the optimal solution of the Rayleigh
    coefficient (basically the ratio of between class scatter to within
    class scatter). This solver supports both classification and
    dimensionality reduction (with optional shrinkage).
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data.
    y : array-like, shape (n_samples,) or (n_samples, n_targets)
        Target values.
    Notes
    -----
    This solver is based on [1]_, section 3.8.3, pp. 121-124.
    References
    ----------
    .. [1] R. O. Duda, P. E. Hart, D. G. Stork. Pattern Classification
       (Second Edition). John Wiley & Sons, Inc., New York, 2001. ISBN
       0-471-05669-3.
    """
    self.means_, self.neighbor_means_ = _class_means_and_neighbor_means(
            X, y, self.within_between_ratio, self.nearest_neighbor_ratio)

    Sw = _class_cov(X, y) # within class cov
    Sb = _local_pairwise_cov(self.means_, self.neighbor_means_)

    evals, evecs = linalg.eigh(Sb, Sw)
    evecs = evecs[:, np.argsort(evals)[::-1]]  # sort eigenvectors
    self.scalings_ = np.asarray(evecs)

def fit(self, X, y):
    """Fit Local Pairwise Trained Linear Discriminant Analysis 
       model according to the given training data and parameters.
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data.
    y : array, shape (n_samples,)
        Target values.
    """

    X, y = check_X_y(np.asarray(X), np.asarray(y.reshape(-1)), ensure_min_samples=2)
    self.classes_ = unique_labels(y)

    # Get the maximum number of components
    if self.n_components is None:
        self.n_components = len(self.classes_) - 1
    else:
        self.n_components = min(len(self.classes_) - 1, self.n_components)

    self._solve_eigen(X, y)
    return self

def transform(self, X):
    """Project data to maximize class separation.
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Input data.
    Returns
    -------
    X_new : array, shape (n_samples, n_components)
        Transformed data.
    """
    check_is_fitted(self, ['scalings_'], all_or_any=any)
    X = check_array(X)
    X_new = np.dot(X, self.scalings_)
    return X_new[:, :self.n_components]

def read_kaldi_scp_flt(kaldi_scp): fvec = { k:v for k,v in kaldi_io.read_vec_flt_scp(kaldi_scp) } # binary return fvec

def load_spk2utt(filename): spk2utt = {} with open(filename, "r") as fp: for line in fp.readlines(): line_split = line.strip().split(" ") spkid = line_split[0] if spkid in spk2utt.keys(): print ("load spk2utt failed, spkid is not uniq, %s\n", spkid) exit(-1) spk2utt[spkid] = [] for i in range(1, len(line_split)): uttid = line_split[i] spk2utt[spkid].append(uttid) return spk2utt

def get_lambda_ids_and_vecs(lambda_xvec, min_utts = 6): ids = [] vecs = [] for spkid in lambda_xvec.keys(): if len(lambda_xvec[spkid]) >= min_utts: for vec in lambda_xvec[spkid]: ids.append(spkid) vecs.append(vec) return ids, vecs

def label_str_to_int(label_str): label_dict = {} label_int = [] for item in label_str: if item not in label_dict.keys(): label_dict[item] = len(label_dict) + 1 label_int.append(label_dict[item]) return np.array(label_int)

def train_lda(ids, vecs, lda_dim):

## compute and sub mean
m = np.mean(vecs, axis=0)
vecs = vecs - m

## lplda
lda = LDA.LinearDiscriminantAnalysis(n_components=lda_dim)
lda.fit(np.asarray(vecs), np.asarray(ids))

## compute mean
dim = len(m)
m_trans = lda.transform(np.reshape(m, (1, dim)))

## compute lda trans
vecs_trans = lda.transform(vecs)

## transform matrix
lda_trans = lda.scalings_.T[:lda_dim, :]

return ids, vecs_trans, m_trans, lda_trans

def train_lplda(ids, vecs, lplda_dim):

## compute and sub mean
m = np.mean(vecs, axis=0)
vecs = vecs - m

## lplda
lda = LocalPairwiseLinearDiscriminantAnalysis(n_components=lplda_dim)
lda.fit(np.asarray(vecs), np.asarray(ids))

## compute mean
dim = len(m)
m_trans = lda.transform(np.reshape(m, (1, dim)))

## compute lda trans
vecs_trans = lda.transform(vecs)

## transform matrix
lda_trans = lda.scalings_.T[:lplda_dim, :]

return ids, vecs_trans, m_trans, lda_trans

def lda_lplda_kaldi_wrapper(lda_dim, lplda_dim, kaldi_scp, kaldi_utt2spk, lda_transform):

data = read_kaldi_scp_flt(kaldi_scp)
spk2utt = load_spk2utt(kaldi_utt2spk)

train_vecs = {}
for spkid in spk2utt.keys():
    train_vecs[spkid] = []
    uttid_uniq = []
    for uttid in spk2utt[spkid]:
        uttid_uniq.append(uttid)
    uttid_uniq = sorted(set(uttid_uniq))
    for uttid in uttid_uniq:
        if uttid in data.keys():
            train_vecs[spkid].append(data[uttid])

## get ids, vecs
ids, vecs = get_lambda_ids_and_vecs(train_vecs)
int_ids = label_str_to_int(ids)
dim = len(vecs[0])
print ("lda lplda, ", len(vecs), len(vecs[0]))

## train lda,lplda
int_ids, lda_trans_vecs, lda_trans_m, lda_trans_mat = train_lda(int_ids, vecs, lda_dim)
int_ids, lplda_trans_vecs, lplda_trans_m, lplda_trans_mat = train_lplda(int_ids, lda_trans_vecs, lplda_dim)
del lplda_trans_vecs, lplda_trans_m

 # copy to kaldi format
transform = np.zeros([lplda_dim, dim + 1], float)
lda_lplda_trans = np.dot(lplda_trans_mat, lda_trans_mat)
lda_lplda_m = np.dot(lplda_trans_mat, np.reshape(lda_trans_m, (lda_dim, 1)))

# m_trans = np.dot(lda_trans, m)
for r in range(lplda_dim):
    for c in range(dim):
        transform[r][c] = lda_lplda_trans[r][c]
    transform[r][dim] = -1.0 * lda_lplda_m[r]

## save lda transform
kaldi_io.write_mat(lda_transform, transform)

return

def lplda_kaldi_wrapper(lda_dim, kaldi_scp, kaldi_utt2spk, lda_transform):

data = read_kaldi_scp_flt(kaldi_scp)
spk2utt = load_spk2utt(kaldi_utt2spk)

train_vecs = {}
for spkid in spk2utt.keys():
    train_vecs[spkid] = []
    uttid_uniq = []
    for uttid in spk2utt[spkid]:
        uttid_uniq.append(uttid)
    uttid_uniq = sorted(set(uttid_uniq))
    for uttid in uttid_uniq:
        if uttid in data.keys():
            train_vecs[spkid].append(data[uttid])

## get ids, vecs
ids, vecs = get_lambda_ids_and_vecs(train_vecs)
int_ids = label_str_to_int(ids)
print ("lplda, ", len(vecs), len(vecs[0]))

## compute and sub mean
m = np.mean(vecs, axis=0)
vecs = vecs - m

## lplda
lda = LocalPairwiseLinearDiscriminantAnalysis(n_components=lda_dim)
lda.fit(np.asarray(vecs), np.asarray(int_ids))

## compute mean
dim = len(m)
transform_m = lda.transform(np.reshape(m, (1, dim)))

# copy to kaldi format
transform = np.zeros([lda_dim, dim + 1], float)
lda_trans = lda.scalings_.T[:lda_dim, :]
# m_trans = np.dot(lda_trans, m)
for r in range(lda_dim):
    for c in range(dim):
        transform[r][c] = lda_trans[r][c]
    transform[r][dim] = -1.0 * transform_m[0][r]

## save lda transform
kaldi_io.write_mat(lda_transform, transform)

return

if name == 'main':

if len(sys.argv) != 6:
    print ("%s lda_dim lplda_dim kaldi_scp kaldi_utt2spk kaldi_lda_transform\n" % sys.argv[0])
    sys.exit

lda_dim = int(sys.argv[1])
lplda_dim = int(sys.argv[2])
kaldi_scp = sys.argv[3]
kaldi_utt2spk = sys.argv[4]
lda_transform = sys.argv[5]

# lda_dim = 150
# lplda_dim = 100
# kaldi_scp = "./xvector_sre16_sre18_combined.scp"
# # kaldi_scp = "./xvectors_sre16_sre18_combined.scp"
# kaldi_utt2spk = "spk2utt"
# lda_transform = "python_kaldi_lplda_transform.mat"

# lplda_kaldi_wrapper(lda_dim, kaldi_scp, kaldi_utt2spk, lda_transform)
lda_lplda_kaldi_wrapper(lda_dim, lplda_dim, kaldi_scp, kaldi_utt2spk, lda_transform)

# ivector-compute-lda --total-covariance-factor=0.0 --dim=$lda_dim \
#   "ark:ivector-subtract-global-mean scp:$nnet_dir/xvectors_$name/xvector.scp ark:- |" \
#   ark:$data/$name/utt2spk $nnet_dir/xvectors_$name/transform.mat

# samples = 20
# dim = 6
# lda_dim = 3

# data = np.random.random((samples, dim))  
# label = np.random.random_integers(0, 2, size=(samples, 1))

# lda = LocalPairwiseLinearDiscriminantAnalysis(lda_dim)
# lda.fit(data, label)
# lda_data = lda.transform(data)

# print (lda_data)

-- coding: utf-8 --

from future import print_function import numpy as np from scipy import linalg from sklearn.utils.multiclass import unique_labels from sklearn.utils import check_array, check_X_y from sklearn.utils.validation import check_is_fitted import sys import kaldi_io

==========================================================================

author : Liang He

descrption : local pairwise linear discriminant analysis

revised from sklearn

created : 20180613

revised :

Liang He, +86-13426228839, heliang@mail.tsinghua.edu.cn

Aurora Lab, Department of Electronic Engineering, Tsinghua University

==========================================================================

all = ['LocalPairwiseTrainedLinearDiscriminantAnalysis']

def _cov(X): """Estimate covariance matrix. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
Returns
-------
s : array, shape (n_features, n_features)
    Estimated covariance matrix.
"""
s = np.cov(X, rowvar=0, bias = 1)    
return s

def _similarity_function(mean_vec, vecs):

dot_kernel = np.array([np.dot(mean_vec, vecs) for i in range(0,len(vecs))])

return dot_kernel

mean_vec_norm = mean_vec / np.sqrt(np.sum(mean_vec ** 2))
vecs_norm = vecs / np.sqrt(np.sum(vecs ** 2, axis=1))[:, np.newaxis]
cosine_kernel = np.array([np.dot(mean_vec_norm, vecs_norm[i]) for i in range(len(vecs_norm))])
return cosine_kernel

def _class_means_and_neighbor_means(X, y, k1, k2): """Compute class means and neighor means Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
    Target values.
k1: within_between_ratio
k2: nearest_neighbor_ratio
Returns
-------
means : array-like, shape (n_features,)
    Class means and neighbor means
"""
means = []
neighbor_means = []

classes = np.unique(y)
samples = np.size(y)

for group in classes:
    Xg = X[y == group, :]
    Xg_count = Xg.shape[0]
    Xg_mean = Xg.mean(0)
    Xn = X[y != group, :]
    Xg_similarity = _similarity_function(Xg_mean, Xg)
    Xg_similarity_min = min(Xg_similarity)
    Xn_similarity = _similarity_function(Xg_mean, Xn)
    Xn_neighbor_count = len(Xn_similarity[Xn_similarity > Xg_similarity_min])
    Xn_neighbor_count = int(max(k1 * Xg_count, k2 * Xn_neighbor_count))
    Xn_neighbor_count = min(Xn_neighbor_count, samples - Xg_count)
    Xn_label = np.argsort(Xn_similarity)
    Xn_label = Xn_label[::-1]
    Xg_neighbor = np.array([Xn[Xn_label[i]] for i in range(Xn_neighbor_count)])
    Xg_neighbor_mean = Xg_neighbor.mean(0)

    means.append(Xg_mean)
    neighbor_means.append(Xg_neighbor_mean)

return np.array(means), np.array(neighbor_means)

def _class_cov(X, y): """Compute class covariance matrix. Parameters

X : array-like, shape (n_samples, n_features)
    Input data.
y : array-like, shape (n_samples,) or (n_samples, n_targets)
    Target values.
shrinkage : string or float, optional
    Shrinkage parameter, possible values:
      - None: no shrinkage (default).
      - 'auto': automatic shrinkage using the Ledoit-Wolf lemma.
      - float between 0 and 1: fixed shrinkage parameter.
Returns
-------
cov : array-like, shape (n_features, n_features)
    Class covariance matrix.
"""
classes = np.unique(y)
covs = []
for group in classes:
    Xg = X[y == group, :]
    covs.append(np.atleast_2d(_cov(Xg)))
return np.average(covs, axis=0)

def _local_pairwise_cov(class_mean, neighbor_mean): """Estimate local pairwise matrix. Parameters

class_mean : array-like, shape (n_samples, n_features)
             each class mean
neighbor_mean: array-like, shape (n_samples, n_features)
             each class neighbor mean
Returns
-------
s : array, shape (n_features, n_features)
    Estimated covariance matrix.
"""
covs = []
for i in range(0, len(class_mean)):
    local_pair = np.vstack((class_mean[i], neighbor_mean[i]))
    covs.append(np.atleast_2d(_cov(local_pair)))
return np.average(covs, axis=0)

class LocalPairwiseLinearDiscriminantAnalysis:

def __init__(self, n_components=None, within_between_ratio=10.0, 
             nearest_neighbor_ratio=1.2):
    self.n_components = n_components
    self.within_between_ratio = within_between_ratio
    self.nearest_neighbor_ratio = nearest_neighbor_ratio

def _solve_eigen(self, X, y):
    """Eigenvalue solver.
    The eigenvalue solver computes the optimal solution of the Rayleigh
    coefficient (basically the ratio of between class scatter to within
    class scatter). This solver supports both classification and
    dimensionality reduction (with optional shrinkage).
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data.
    y : array-like, shape (n_samples,) or (n_samples, n_targets)
        Target values.
    Notes
    -----
    This solver is based on [1]_, section 3.8.3, pp. 121-124.
    References
    ----------
    .. [1] R. O. Duda, P. E. Hart, D. G. Stork. Pattern Classification
       (Second Edition). John Wiley & Sons, Inc., New York, 2001. ISBN
       0-471-05669-3.
    """
    self.means_, self.neighbor_means_ = _class_means_and_neighbor_means(
            X, y, self.within_between_ratio, self.nearest_neighbor_ratio)

    Sw = _class_cov(X, y) # within class cov
    Sb = _local_pairwise_cov(self.means_, self.neighbor_means_)

    evals, evecs = linalg.eigh(Sb, Sw)
    evecs = evecs[:, np.argsort(evals)[::-1]]  # sort eigenvectors
    self.scalings_ = np.asarray(evecs)

def fit(self, X, y):
    """Fit Local Pairwise Trained Linear Discriminant Analysis 
       model according to the given training data and parameters.
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Training data.
    y : array, shape (n_samples,)
        Target values.
    """

    X, y = check_X_y(np.asarray(X), np.asarray(y.reshape(-1)), ensure_min_samples=2)
    self.classes_ = unique_labels(y)

    # Get the maximum number of components
    if self.n_components is None:
        self.n_components = len(self.classes_) - 1
    else:
        self.n_components = min(len(self.classes_) - 1, self.n_components)

    self._solve_eigen(X, y)
    return self

def transform(self, X):
    """Project data to maximize class separation.
    Parameters
    ----------
    X : array-like, shape (n_samples, n_features)
        Input data.
    Returns
    -------
    X_new : array, shape (n_samples, n_components)
        Transformed data.
    """
    check_is_fitted(self, ['scalings_'], all_or_any=any)
    X = check_array(X)
    X_new = np.dot(X, self.scalings_)
    return X_new[:, :self.n_components]

def read_kaldi_scp_flt(kaldi_scp): fvec = { k:v for k,v in kaldi_io.read_vec_flt_scp(kaldi_scp) } # binary return fvec

def load_spk2utt(filename): spk2utt = {} with open(filename, "r") as fp: for line in fp.readlines(): line_split = line.strip().split(" ") spkid = line_split[0] if spkid in spk2utt.keys(): print ("load spk2utt failed, spkid is not uniq, %s\n", spkid) exit(-1) spk2utt[spkid] = [] for i in range(1, len(line_split)): uttid = line_split[i] spk2utt[spkid].append(uttid) return spk2utt

def get_lambda_ids_and_vecs(lambda_xvec, min_utts = 6): ids = [] vecs = [] for spkid in lambda_xvec.keys(): if len(lambda_xvec[spkid]) >= min_utts: for vec in lambda_xvec[spkid]: ids.append(spkid) vecs.append(vec) return ids, vecs

def label_str_to_int(label_str): label_dict = {} label_int = [] for item in label_str: if item not in label_dict.keys(): label_dict[item] = len(label_dict) + 1 label_int.append(label_dict[item]) return np.array(label_int)

def lplda_kaldi_wrapper(lda_dim, kaldi_scp, kaldi_utt2spk, lda_transform):

data = read_kaldi_scp_flt(kaldi_scp)
spk2utt = load_spk2utt(kaldi_utt2spk)

# train_vecs = {}
# for spkid in spk2utt.keys():
#     train_vecs[spkid] = []  
#     for uttid in spk2utt[spkid]:
#         map_uttid = spkid[6:] + "_" + uttid + "_A"            
#         if map_uttid in data.keys():
#             train_vecs[spkid].append(data[map_uttid])

train_vecs = {}
for spkid in spk2utt.keys():
    train_vecs[spkid] = []
    uttid_uniq = []
    for uttid in spk2utt[spkid]:
        uttid_uniq.append(uttid)
    uttid_uniq = sorted(set(uttid_uniq))
    for uttid in uttid_uniq:
        if uttid in data.keys():
            train_vecs[spkid].append(data[uttid])

## get ids, vecs
ids, vecs = get_lambda_ids_and_vecs(train_vecs)
int_ids = label_str_to_int(ids)
print ("lplda, ", len(vecs), len(vecs[0]))

## compute and sub mean
m = np.mean(vecs, axis=0)
vecs = vecs - m

## lplda
lda = LocalPairwiseLinearDiscriminantAnalysis(n_components=lda_dim)
lda.fit(np.asarray(vecs), np.asarray(int_ids))

## compute mean
dim = len(m)
transform_m = lda.transform(np.reshape(m, (1, dim)))

# copy to kaldi format
transform = np.zeros([lda_dim, dim + 1], float)
lda_trans = lda.scalings_.T[:lda_dim, :]
# m_trans = np.dot(lda_trans, m)
for r in range(lda_dim):
    for c in range(dim):
        transform[r][c] = lda_trans[r][c]
    transform[r][dim] = -1.0 * transform_m[0][r]

## save lda transform
kaldi_io.write_mat(lda_transform, transform)

return

if name == 'main':

if len(sys.argv) != 5:
    print ("%s lda_dim kaldi_scp kaldi_utt2spk kaldi_lda_transform\n" % sys.argv[0])
    sys.exit

lda_dim = sys.argv[1]
kaldi_scp = sys.argv[2]
kaldi_utt2spk = sys.argv[3]
lda_transform = sys.argv[4]

# lda_dim = 100
# kaldi_scp = "./xvector_sre16_sre18_combined.scp"
# kaldi_utt2spk = "spk2utt"
# lda_transform = "python_kaldi_lplda_transform.mat"

lplda_kaldi_wrapper(lda_dim, kaldi_scp, kaldi_utt2spk, lda_transform)

# ivector-compute-lda --total-covariance-factor=0.0 --dim=$lda_dim \
#   "ark:ivector-subtract-global-mean scp:$nnet_dir/xvectors_$name/xvector.scp ark:- |" \
#   ark:$data/$name/utt2spk $nnet_dir/xvectors_$name/transform.mat

# samples = 20
# dim = 6
# lda_dim = 3

# data = np.random.random((samples, dim))  
# label = np.random.random_integers(0, 2, size=(samples, 1))

# lda = LocalPairwiseLinearDiscriminantAnalysis(lda_dim)
# lda.fit(data, label)
# lda_data = lda.transform(data)

# print (lda_data)
rameshkunasi commented 4 years ago

@sanphiee , I have trained lplda+PLDA and kaldi lda+PLDA with 150K utterances. I have seen no improvement in EER with lplda+PLDA compared with kaldi lda+PLDA.

Is there any method to improve speaker verification?

sanphiee commented 4 years ago

Some suggestions:

1) You can print the eigenvalues of LDA and LPLDA to select a proper dimension.

The proper dimensions of LDA and LPLDA may be different.

2) It also depends on the data, see the attachment Fig (From ID R&D NIST SRE19 system description).

You can see that the distributions of X-vectors on NIST SRE04-08, NIST SRE18 or SRE19 are different.

The LPLDA needs the selection of neighbor points, which is easily influenced by the total distribution.

3) The whole system configuration is another consideration.

For example, if you take as-norm as your score post-process method, the performance of LPLDA may be degrade.

Because the as-norm also needs selection of neighbor scores (Top N scores).


Yours sincerely,

He Liang,

Rohm Building 8101,

Department of Electronic Engineering, Tsinghua University,

Beijing, 10084, China

发件人: noreply@github.com noreply@github.com 代表 Kunasi Ramesh 发送时间: Tuesday, December 31, 2019 8:21 PM 收件人: sanphiee/LPLDA LPLDA@noreply.github.com 抄送: He Liang heliang@mail.tsinghua.edu.cn; Mention mention@noreply.github.com 主题: Re: [sanphiee/LPLDA] Process is getting killed (#1)

@sanphiee https://github.com/sanphiee , I have trained lplda+PLDA and kaldi lda+PLDA with 150K utterances. I have seen no improvement in EER with lplda+PLDA compared with kaldi lda+PLDA.

Is there any method to improve speaker verification?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanphiee/LPLDA/issues/1?email_source=notifications&email_token=AFU2POIKCKIOC75RTTUT2L3Q3M2JJA5CNFSM4JRIT2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH4EYJI#issuecomment-569920549 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AFU2POL4NERUUXOVCKJASHLQ3M2JJANCNFSM4JRIT2GA . https://github.com/notifications/beacon/AFU2POMHKV6PB72GGR4DXLTQ3M2JJA5CNFSM4JRIT2GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH4EYJI.gif

rameshkunasi commented 4 years ago

@sanphiee

  1. I have tried 4 different combinations of LDA and LPLDA dimensions. In all 4 cases there is no improvement w.r.t kaldi LDA+PLDA. For training LPLDA with 150K utterance it is taking huge time around 4 to 5 hrs. Because of this limitation I didn't tried more combinations.

  2. If I want to train LPLDA with more than 150K utterances, the process is getting killed. My CPU RAM is 16GB. Is there any way to train LPLDA with more utterances?

sanphiee commented 4 years ago
  1. what’s your evaluation data, sre16? Sre 18? I suggest you do some visualization analysis, e.g. tsne.
  2. For the large data, I guess you can compute within class covariance and local between covariance by batch operation.

iPhone

在 2020年1月2日,15:25,Kunasi Ramesh notifications@github.com 写道:

 @sanphiee

I have tried 4 different combinations of LDA and LPLDA dimensions. In all 4 cases there is no improvement w.r.t kaldi LDA+PLDA. For training LPLDA with 150K utterance it is taking huge time around 4 to 5 hrs. Because of this limitation I didn't tried more combinations.

If I want to train LPLDA with more than 150K utterances, the process is getting killed. My CPU RAM is 16GB. Is there any way to train LPLDA with more utterances?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

rameshkunasi commented 4 years ago
  1. My evaluation data is SITW dataset. I visualized x-vector data using tsne.

If I want use the x-vector model and PLDA model in real time scenarios, how to select threshold in this case. Because same threshold(SITW evaluation) if I use for real-time scenarios the performance is not up to the mark