Open beomyeol opened 8 years ago
I agree with your concerns. Dolphin
is a framework, not an application. But, at this point, I have a concern about reinventing the wheel. Why don't you use ND4J serialization? If you agree with using ND4J format, @swlsw and I will make a converter for that. Let's assume like the following simply.
By the way, in reality, we do not use both MNIST vector format or CIFAR-10 pickle file. We mostly use the only original files like JPEG.
Thank you for your suggestion, @dongjoon-hyun. I will consider ND4J serialization and discuss this with @jsjason. If it is okay to use it, I will let you know and start implementing it.
I can connect to SKT cluster through VPN. Thanks to @jsjason.
Thank you for considering. By the way, I found that the following codes in DL4J and ND4J. Actually, the file is plain text file delimeted spaces. three-spaces : " "
ClassPathResource resource = new ClassPathResource("/mnist2500_X.txt");
File f = resource.getFile();
INDArray data = Nd4j.readNumpy(f.getAbsolutePath()," ").get(NDArrayIndex.interval(0,100),NDArrayIndex.interval(0,784));
/**
* Read line via input streams
*
* @param filePath the input stream ndarray
* @param split the split separator
* @return the read txt method
*/
public static INDArray readNumpy(String filePath, String split) throws IOException {
return readNumpy(new FileInputStream(filePath), split);
}
/**
* Read line via input streams
*
* @param filePath the input stream ndarray
* @return the read txt method
*/
public static INDArray readNumpy(String filePath) throws IOException {
return readNumpy(filePath, "\t");
}
I think we already have Numpy compatible read function in ND4J.
Thank you for letting me know readNumpy()
. But, I have a concern. Is it okay to use plain text file with delimiter? Using a plain text file needs more space than a binary file.
In addition, I saw the code of readNumpy()
in ND4J library. It supports Numpy compatible plain text file, but does not support Numpy compatible binary file such as .npy
or .npz
.
Yep. That is right. But I think we can depend on that part in ND4J layer.
If we design our architecture having ND4J layer that handles readNumpy
, the converting job for numpy
is a piece of cake. We can implement a converter for numpy as a just a small python script with opening .npy
and storing .txt
. :)
By the way, for the efficiency, we have to distinguish between input file format and internal storage format. The followings are my opinions until now.
readNumpy()
.@dongjoon-hyun When you say 'internal storage format', are you referring to the intermediate and final output data?
One thing I am concerned about is our dependency on ND4J. I don't know much about scientific computing libraries, but is it okay to rely on ND4J this much? We could search for and use a library with a greater community.
@dongjoon-hyun, Okay, we can decide to use a plain text format as input format. I have one more concern about it. REEF does not support multiple data sources now as we discussed in #63. We need to put images and labels into a single text file and consider this file format. I think about following format.
(image) (delimiter) (label) (newline) (image) (delimiter) (label) (newline) ... (image) (delimiter) (label) (newline)
By using readNumpy()
, image and label data can be loaded and we can set ',' as the delimiter, for example. Is this format fine to use? If so, I don't think we need custom InputFormat
and RecordReader
. We can just use TextInputFormat
.
@beomyeol It'd be nice to resolve #63 eventually. But if #63 takes time, we should address it later since there are other more important issues.
@jsjason , I meant 'internal storage format' for really dolphin
's internal format, if needed. It's not output format.
For dependency, I always welcome your further research and proposal for better BLAS library supporting CPU/GPU. :)
Ur, @beomyeol , I meant float
matrix for dolphin
. Sorry for making you confuse. All image/sound/text data will be transformed by me and @swlsw into numpy
matrix for dolphin
. dolphin
has no need to care about that. What I described above is the real final application goal. For dolphin
nueral network algorithm, you can assume that float
matrix as a input and perform mathematical operation only.
For the train/test data and label, you can read with the similar way as you described, i.e., m x (n + 1) matrix.
In addition, dolphin
should load pre-trained
model. This is more important. Do you have any idea for this?
The pre-trained model equals the initial parameter set for the DNN case, right? Unlike the other algorithms, for DNNs we are trying to provide a ParameterInitializer
that generates the initial values for edge weights and biases.
@dongjoon-hyun. I am still confused a little bit. What is the format of file which dolphin
loads? Is it a Numpy compatible plain text file format like 'mnist2500_X.txt' in DL4J?
in addition, for pre-trained
model, I have not thought about it yet. We may need a snapshot feature of neural network and a feature of reconstructing neural network from the the snapshot. I'd like to discuss this as a separate issue.
@jsjason , that's right. ParameterInitializer
sounds Good!
@beomyeol . For the first question, Yes. For the second question, @jsjason answered in the previous comment.
Thanks, @beomyeol and @dongjoon-hyun. Let's keep this issue open since we'll probably going to have more discussions when PRs starts to come up.
Thank @dongjoon-hyun for you comment :)
For various datasets, the data is stored in different file formats. For example, the data of MNIST database is saved in their own file format and the data of CIFAR-10 database is stored in Python pickle file and their own file format. Supporting all these file format is too burdensome. So, I suggest defining a new file format which our DNN uses to load data from file. In order to support a variety of datasets such as MNIST and ImageNet, we can convert these datasets to our file format and provide them for DNN. After define the new file format, we also need
InputFormat
andRecordReader
for it to run our neural network on REEF.