nttcslab-sp / kaldiio

A pure python module for reading and writing kaldi ark files
Other
248 stars 35 forks source link

read to download sample wav.scp file(include pipe sox) #27

Open shanguanma opened 5 years ago

shanguanma commented 5 years ago

Hi all, I want to use the kaldiio library to read wav.scp and segments file,but in wav.scp file,It contains pipe commands like the following: ui23faz_0101 /usr/bin/sox /path/ui23faz_0102/ui23faz_0102.wav -r 16000 -c 1 -b 16 -t wav - downsample |" the kaldiio reader is not working. Does kaldiio not support such wav.scp?

nttcslab-sp-admin commented 5 years ago

Thank you for using our tool. Could show me the error log?

shanguanma commented 5 years ago

Thank you for your reply, this is my error log: Colocations handled automatically by placer. 2019-05-08 19:51:53.221278: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 2019-05-08 19:51:53.225237: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3000125000 Hz 2019-05-08 19:51:53.225357: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5555599691e0 executing computations on platform Host. Devices: 2019-05-08 19:51:53.225375: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): , 2019-05-08 19:51:53.225462: I tensorflow/core/common_runtime/process_util.cc:71] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance. /usr/bin/sox WARN wav: Premature EOF on .wav input file /usr/bin/sox WARN rate: rate clipped 17 samples; decrease volume? /usr/bin/sox WARN dither: dither clipped 12 samples; decrease volume? /home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/utils.py:320: UserWarning: An error happens at loading "/usr/bin/sox /home4/md510/w2018/original_seame/wavdata/interview/ui23faz_0102/ui23faz_0102.wav -r 16000 -c 1 -b 16 -t wav - downsample |" 'An error happens at loading "{}"'.format(ark_name)) Traceback (most recent call last): File "local/compute-fbank-feats.py", line 93, in main() File "local/compute-fbank-feats.py", line 81, in main for utt_id, (rate, array) in reader: File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/highlevel.py", line 128, in iter k, v = next(self.generator) File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/matio.py", line 115, in load_scp_sequential segments=segments).generator(): File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/matio.py", line 162, in generator cached[recodeid] = self.wav_loader[recodeid] File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/utils.py", line 317, in getitem return self._loader(ark_name) File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/matio.py", line 205, in load_mat use_scipy_wav=offset is None) File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/matio.py", line 265, in _load_mat array = read_kaldi(fd, endian, use_scipy_wav=use_scipy_wav) File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/matio.py", line 334, in read_kaldi array, size = read_wav_scipy(fd, return_size=True) File "/home3/md510/anaconda3/lib/python3.7/site-packages/kaldiio/wavio.py", line 44, in read_wav_scipy rate, array = wavfile.read(fd) File "/home3/md510/anaconda3/lib/python3.7/site-packages/scipy/io/wavfile.py", line 246, in read raise ValueError("Unexpected end of file.") ValueError: Unexpected end of file.

nttcslab-sp-admin commented 5 years ago

Maybe your wav file has some problem. kaldiio just uses scipy for loading wav file, so you can check it as following:

/usr/bin/sox /path/ui23faz_0102/ui23faz_0102.wav -r 16000 -c 1 -b 16 -t wav - downsample > out.wav
python
>>> import scipy.io.wavfile
>>> scipy.io.wavfile.read('out.wav')
shanguanma commented 5 years ago

Thanks for your reply. I use your method to test, my wav file is no problem. The test results are as follows: /usr/bin/sox /home4/md510/w2018/original_seame/wavdata/interview/ui23faz_0101/ui23faz_0101.wav -r 16000 -c 1 -b 16 -t wav - downsample > out.wav /usr/bin/sox WARN rate: rate clipped 17 samples; decrease volume? /usr/bin/sox WARN dither: dither clipped 17 samples; decrease volume? python3

import scipy.io.wavfile scipy.io.wavfile.read('out.wav') (16000, array([ -1, 1, -1, ..., -17, -5, 4], dtype=int16))

nttcslab-sp-admin commented 5 years ago

Your wave file has incorrect file size information in the header and scipy.io.wavfile doesn't support such wave file.

 /usr/bin/sox WARN wav: Premature EOF on .wav input file

I changed to use wave module in new kaldiio now. Try pip install -U kaldiio.

shanguanma commented 5 years ago

Thank you, I upgraded the kaldiio library as you suggested. In addition, mel-fbank is generated in 6-hour small data set and written into kaldi's ark and SCP file format. It is generated in 10 processes, one hour and four minutes. But I switched to a larger data set (96 hours) and 32 processes. The program has not finished running for 30 hours. Is it the beginning of kaldiio's reading and writing efficiency slowly changing with time?

nttcslab-sp-admin commented 5 years ago

Maybe, simple reading without segments file can performs not so slowly comparing with kaldi, because it is just using subprocess for invoking commands and scipy/python-wave, but I haven't optimized it for segments.

Could you tell me more information in your case - how long are each wave files and how long are segments in the wave files? If you could, attaching the scp file and semgents would help me.

shanguanma commented 5 years ago

Thank you for your reply. I used this 96-hour data set and it worked well in kaldi, but I used the read-write matrix interface of kaldiio to run for three days without extracting the features. According to your request, I explained my data set, the wave length is about 1-2 hours, and the segments length is about 2-7 seconds.

nttcslab-sp-admin commented 5 years ago

I created test set almost matching your corpus, but in my environment, it doesn't take such a long time. It performed as same speed as kaldi itself.

I was curious that your logging included tensorflow's message.Are you trying to extract the feature from wavfile in training script?

In general, the invoking subprocess takes much long time if a large mount of memory are allocated.

For example,

import numpy
import subprocess
import time

t = time.time()
subprocess.run('echo hello', shell=True)
print(f'{time.time() - t} [x]')

x = numpy.ones((100000000,))
t = time.time()
# Take much more time
subprocess.run('echo hello', shell=True)
print(f'{time.time() - t} [x]')

This is not the fault of python's subprocess, but fork() system call has such feature. Thus, if you'll invoke sox via wav.scp, you need to care not to allocate extra memory as possible.

shanguanma commented 5 years ago

Thanks for your reply, I'm going to check code somewhere else.