microsoft / CNTK

Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit
https://docs.microsoft.com/cognitive-toolkit/
Other
17.52k stars 4.28k forks source link

Any advice on implementing a custom data reader? #814

Closed svml closed 7 years ago

svml commented 8 years ago

I would like to implement a custom data reader, which would get data from efficient binary files instead of text files. Could you please give me some advice on that? For example, which class my custom data reader should inherit from and which functions I have to override? Anything else I have to do or keep in mind?

Thank you very much for your help!

eldakms commented 8 years ago

hi svml,

You need to implement your own deserializer. You can have a look how HTKDataDeserializer/MLFDataDeserializer/ImageDataDeserializer are implemented.

Will it be possibly for you to use HTKDataDeserializer,?( if you have only binary dense data, it does not support sparse) Thanks!

svml commented 8 years ago

Hi eldakms,

Thank you very much for your reply!

My data is dense, so it sounds like I will be able to use HTKDataDeserializer. I am looking into possibility of storing and accessing data in an efficient binary format, and I want to create an interface to one of those.

If my data is dense, could you please let me know what are the steps I need to take to implement a custom data reader?

eldakms commented 8 years ago

Hi svml,

I would encourage you to convert your data to HTK and use HTKDataDeserializer. It is a known/efficient format and there are tools that will help you to work with this format.

If you want to have a custom one, please have a look how HTKDataDeserializer is implemented. Here are the high level things you have to do:

This sequences are packed and handed over to the corresponding input node. Please have a look at the ReaderLib/DataDeserializer and several implementations of this interface.

We intend to include a simple sample data reader in the next release of CNTK.

Thank you!

svml commented 8 years ago

Hi eldakms,

Thank you very much for the detailed reply.

It does sound like there are a lot of technical details that need to be understood to even get started. Perhaps, CNTK has a tool to convert CNTK text files into HTK files? Preferably, something that could take as input one CNTK text file one line at a time, so that I could convert my data into such lines one sample at a time and feed them one by one into the converter:

HTKFileWriter w("output.htk");
for(int i=0; i<nSamples; i++)
     w.Add(features, labels); // string features, string labels
w.Finish();

A whole text file into a HTK file conversion tool, of course, would also be very useful.

Also, could you please let me know if HTK files allow to put different features/labels into different files, so that one could easily create new sets of features/labels without having to recreate the whole input data files?

I.e. something like

    // * each file can contain some number of features and/or labels
    // * number of samples in each file is the same
    inputFiles = file1, file2, file3...  

    // for example: file1 contains features f1, f2 and label l1
    //              file2 contains features f3, f4, f5
    //              file3 contains labels l2, l3

Thank you very much for your help!

frankseide commented 8 years ago

The HTK feature format does not support sparse/one-hot data. It seems you need that. You would have to encode every word as a dense one-hot vector, which is likely inefficient.

On the other hand, Yes, features and labels are in different files. HTK uses the "master label file" (MLF) format, which is very speech-focussed. So you could try, depending on your task, to read dense float features from HTK-formatted feature files, and labels with the CNTKTextFormatReader.

@eldakms, is this supported?

eldakms commented 8 years ago

This has not been tested with speech, but we have this setup with images so it should work. If there are any issues please ping me. Just make sure that scp files contains sequences with the same key (numerical) as the .ctf data, because currently CNTKTextFormat only supports numeric sequence ids. i.e.

in scp file: 1.mfc=[...] 2.mfc=[...] 3.mfc=[...] ....

and in ctf: 1 |labels .... 2 |labels... 3 |labels...

the config should use both deserializers:

reader=[ randomize=true deserializers = ([ type="HTKDeserializer" ..... ]: [ type="CNTKTextFormatDeserializer" .... ] ) ] More info can be found here https://github.com/Microsoft/CNTK/wiki/Understanding-and-Extending-Readers
Thank you!