tensorflow / lingvo

Lingvo
Apache License 2.0
2.81k stars 440 forks source link

about the saving format in ASR #111

Open boji123 opened 5 years ago

boji123 commented 5 years ago

In asr, the saving format of create_asr_feature is sparse, the data is saved by [index, data] pairs, which have redundancy and is not efficient at all. The tfrecord file is much larger then the file in kaldi archive format, thus it occupy much more IO. Hope someone can check about it.

boji123 commented 5 years ago

the question may cause by flat_frames = frames.flatten() https://github.com/tensorflow/lingvo/blob/179aa83e73c71e157f567d12e7eea5dff5fd7a1f/lingvo/tools/create_asr_features.py#L68

drpngx commented 5 years ago

@boji123 we're using the tensorflow example proto inside the tfrecord. It only supports vectors, so we're flattening the frames so the vector contains the concatenated vectors for all times.

The proto is not really that efficient. You can use gzip for the format:

options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.GZIP)
writer = tf.python_io.TFRecordWriter(outFilePath, options=options)

But it's going to use more CPU for training because of decompression. Kaldi does smarter things. Also you might not gain a lot. If it works, try proposing adding LZ4HC support in tensorflow.

boji123 commented 5 years ago

When I print the tensor information out https://github.com/tensorflow/lingvo/blob/91a85acbd19db8feebb0690fe843e411ca893056/lingvo/tools/create_asr_features.py#L73 It returns sparse vector, like

        value: 10.3430614471
        value: 10.035987854
        value: 10.0856647491
        value: 10.8315954208
        value: 10.6160020828
        value: 9.71011066437
        value: 9.79291915894
        value: 10.7625579834
        value: 10.7301397324
        value: 10.4424333572
        value: 9.74415683746
        value: 9.64957237244
        value: 9.97068881989
        value: 10.120265007
        value: 10.1278810501
        value: 9.64612293243
        value: 9.87319087982
        value: 10.096116066
        value: 9.80495834351
        value: 10.3020324707
        value: 10.0410823822
        value: 10.3969669342
        value: 10.5008707047
        value: 10.2063856125
        value: 9.89092540741
        value: 10.2531461716
        value: 10.4820013046
        value: 10.2427978516
        value: 9.38924694061
        value: 9.45041275024
        value: 8.95966911316
        value: 8.3851518631
        value: 8.77469444275
        value: 9.73690605164
        value: 9.61734771729
        value: 8.69157791138
        value: 7.79124832153
        value: 7.20751333237
        value: 7.24420547485
        value: 7.90906238556
        value: 7.8983259201
        value: 7.8520693779
        value: 8.2546043396
        value: 8.44586181641
        value: 7.84621143341
        value: 7.49394321442
        value: 7.43842458725
        value: 6.67486095428

maybe you can checkabout it, it's unnormal because when parsing the examples, it returns a sparse frame vector, maybe it's a time consuming operation to convert and reshape sparse vector to [batch, dim] I think when saving, numpy.tostring may help to generate a continuous feature,and when reading, use numpy.fromstring to convert the datastring to numpyformat and then convert it into tensor may help