Define a new file format and implement InputFormat and RecordReader for it

beomyeol commented 8 years ago

For various datasets, the data is stored in different file formats. For example, the data of MNIST database is saved in their own file format and the data of CIFAR-10 database is stored in Python pickle file and their own file format. Supporting all these file format is too burdensome. So, I suggest defining a new file format which our DNN uses to load data from file. In order to support a variety of datasets such as MNIST and ImageNet, we can convert these datasets to our file format and provide them for DNN. After define the new file format, we also need InputFormat and RecordReader for it to run our neural network on REEF.

dongjoon-hyun commented 8 years ago

I agree with your concerns. Dolphin is a framework, not an application. But, at this point, I have a concern about reinventing the wheel. Why don't you use ND4J serialization? If you agree with using ND4J format, @swlsw and I will make a converter for that. Let's assume like the following simply.

All data have a matrix form using INDArray interface.
Training / Cross Validation / Test Data Set will be loaded as a m x n matrix.
- m is a number of instance.
- n is a number of feature.
Label Data Set will be loadded as a m x 1 vector.

By the way, in reality, we do not use both MNIST vector format or CIFAR-10 pickle file. We mostly use the only original files like JPEG.

For MNIST, the original data is JPEG file. (Black and White) The folder structure is described in #63. (hdfs://data/image/mnist/jpg) @jsjason , could you download this from SKT cluster into your cluster, if you need? You can use SKT cluster itself too.
For ImageNet, the original data is also JPEG file. (RGB Color) Also, you can download them in our cluster (hdfs://data/image/imagenet/tar) You can connect through VPN. Please get one from @jsjason .

beomyeol commented 8 years ago

Thank you for your suggestion, @dongjoon-hyun. I will consider ND4J serialization and discuss this with @jsjason. If it is okay to use it, I will let you know and start implementing it.

I can connect to SKT cluster through VPN. Thanks to @jsjason.

dongjoon-hyun commented 8 years ago

Thank you for considering. By the way, I found that the following codes in DL4J and ND4J. Actually, the file is plain text file delimeted spaces. three-spaces : " "

DL4J

        ClassPathResource resource = new ClassPathResource("/mnist2500_X.txt");
        File f = resource.getFile();
        INDArray data = Nd4j.readNumpy(f.getAbsolutePath(),"   ").get(NDArrayIndex.interval(0,100),NDArrayIndex.interval(0,784));

ND4J

    /**
     * Read line via input streams
     *
     * @param filePath the input stream ndarray
     * @param split    the split separator
     * @return the read txt method
     */
    public static INDArray readNumpy(String filePath, String split) throws IOException {
        return readNumpy(new FileInputStream(filePath), split);
    }

    /**
     * Read line via input streams
     *
     * @param filePath the input stream ndarray
     * @return the read txt method
     */
    public static INDArray readNumpy(String filePath) throws IOException {
        return readNumpy(filePath, "\t");
    }

I think we already have Numpy compatible read function in ND4J.

beomyeol commented 8 years ago

Thank you for letting me know readNumpy(). But, I have a concern. Is it okay to use plain text file with delimiter? Using a plain text file needs more space than a binary file.

In addition, I saw the code of readNumpy() in ND4J library. It supports Numpy compatible plain text file, but does not support Numpy compatible binary file such as .npy or .npz.

dongjoon-hyun commented 8 years ago

Yep. That is right. But I think we can depend on that part in ND4J layer. If we design our architecture having ND4J layer that handles readNumpy, the converting job for numpy is a piece of cake. We can implement a converter for numpy as a just a small python script with opening .npy and storing .txt. :)

dongjoon-hyun commented 8 years ago

By the way, for the efficiency, we have to distinguish between input file format and internal storage format. The followings are my opinions until now.

For the input format, just call readNumpy().
For the internal storage format, just use ND4J serialization.

jsjason commented 8 years ago

@dongjoon-hyun When you say 'internal storage format', are you referring to the intermediate and final output data?

jsjason commented 8 years ago

One thing I am concerned about is our dependency on ND4J. I don't know much about scientific computing libraries, but is it okay to rely on ND4J this much? We could search for and use a library with a greater community.

beomyeol commented 8 years ago

@dongjoon-hyun, Okay, we can decide to use a plain text format as input format. I have one more concern about it. REEF does not support multiple data sources now as we discussed in #63. We need to put images and labels into a single text file and consider this file format. I think about following format.

(image) (delimiter) (label) (newline) (image) (delimiter) (label) (newline) ... (image) (delimiter) (label) (newline)

By using readNumpy(), image and label data can be loaded and we can set ',' as the delimiter, for example. Is this format fine to use? If so, I don't think we need custom InputFormat and RecordReader. We can just use TextInputFormat.

bgchun commented 8 years ago

@beomyeol It'd be nice to resolve #63 eventually. But if #63 takes time, we should address it later since there are other more important issues.

dongjoon-hyun commented 8 years ago

@jsjason , I meant 'internal storage format' for really dolphin's internal format, if needed. It's not output format.

For dependency, I always welcome your further research and proposal for better BLAS library supporting CPU/GPU. :)

dongjoon-hyun commented 8 years ago

Ur, @beomyeol , I meant float matrix for dolphin. Sorry for making you confuse. All image/sound/text data will be transformed by me and @swlsw into numpy matrix for dolphin. dolphin has no need to care about that. What I described above is the real final application goal. For dolphin nueral network algorithm, you can assume that float matrix as a input and perform mathematical operation only.

dongjoon-hyun commented 8 years ago

For the train/test data and label, you can read with the similar way as you described, i.e., m x (n + 1) matrix.

m : the number of data instance
n : the number of features
1 : the last column is the label column.

In addition, dolphin should load pre-trained model. This is more important. Do you have any idea for this?

jsjason commented 8 years ago

The pre-trained model equals the initial parameter set for the DNN case, right? Unlike the other algorithms, for DNNs we are trying to provide a ParameterInitializer that generates the initial values for edge weights and biases.

beomyeol commented 8 years ago

@dongjoon-hyun. I am still confused a little bit. What is the format of file which dolphin loads? Is it a Numpy compatible plain text file format like 'mnist2500_X.txt' in DL4J?

in addition, for pre-trained model, I have not thought about it yet. We may need a snapshot feature of neural network and a feature of reconstructing neural network from the the snapshot. I'd like to discuss this as a separate issue.

dongjoon-hyun commented 8 years ago

@jsjason , that's right. ParameterInitializer sounds Good!

dongjoon-hyun commented 8 years ago

@beomyeol . For the first question, Yes. For the second question, @jsjason answered in the previous comment.

jsjason commented 8 years ago

Thanks, @beomyeol and @dongjoon-hyun. Let's keep this issue open since we'll probably going to have more discussions when PRs starts to come up.

beomyeol commented 8 years ago

Thank @dongjoon-hyun for you comment :)

snuspl / dolphin

Define a new file format and implement InputFormat and RecordReader for it #77

DL4J

ND4J