graykode / matorage

Matorage is tensor(multidimensional matrix) object storage manager for deep learning framework(Pytorch, Tensorflow V2, Keras)
https://matorage.readthedocs.io
Other
73 stars 8 forks source link

A few questions for the usage #1

Open jinserk opened 4 years ago

jinserk commented 4 years ago

It's really fantastic! Thank you so much for sharing this project. I had a quick test with minio docker process and confirmed it works really well as expected. I'd like to ask a few questions about the usage:

graykode commented 4 years ago

@jinserk Thank you for your interest in the project!

traindata_saver({ 'image': image, 'target': target })

- If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it? : If you simply add more data (append mode) does not matter if you save using the existing config. However, if you want to refresh your data, it is not currently implemented in code. If you want to refresh dataset (this means the same as you want to remove buckets), you should use minio web site or [mc command](https://github.com/minio/mc) from minio. (`mc rb --force --dangerous local/<bucket_name>`). I will further implement the refresh method by adding a new option on datasaver like this:
```python
traindata_saver({
        'image': image,
        'target': target
    }, refresh=True) # I will add this argument

Best Regard Tae Hwan

jinserk commented 4 years ago

@graykode Thank you very much for the detailed answers! It's really helpful, and very impressive.

My first question is actually related to the heterogeneous shape of tensors, which means the case that the image size, in your mnist example, could be changed as sample by sample. Practically I'm working on a chemical problem -- molecule classification in chemistry or pharma companies -- and the input feature can be graphs whose sizes are various according to the molecules. I know this cannot be simply implemented using an attributes, and that's why I asked about the sparse matrices support. Hope this could be implemented and be used sooner! :)

graykode commented 4 years ago

@jinserk Matrices with atypical shapes are difficult to store regardless of sparse. Moreover, sparse is not difficult to implement because it is guaranteed by through scipy(https://github.com/appier/h5sparse). However, it is very difficult to store a tensor with an undefined shape as hdf5.

I have a question, in order to make a model of pytorch, all input shapes must be the same, but I am curious how a tensor of heterogeneous shape can be an input of a model.

jinserk commented 4 years ago

Good question. Basically I'm using a fixed shape for the input of the model. In the training, I just pad the heterogeneous shapes of the input with a fixed shape of max dim values. I made a quick test to store all my dataset in the form of padded dense matrices and got almost 100 times bigger stored file, which is totally impractical.

graykode commented 4 years ago

If so, how about storing the fixed tensor itself that goes into the model's input in matorage? This is the core idea of matorage. Also using high compressor leve(7~9) can helps sparse matrix to store better.

jinserk commented 4 years ago

Thanks for the suggestion, @graykode ! I will try to do so, since I don't know how the "compression" will work well. I had once tried to store the fixed tensors but the serialized file was almost 400 GB (used torch.save) while the file with sparse tensor was only 4 GB. I still hope the storing sparse tensor will be supported in this matorage project soon. :)

graykode commented 4 years ago

@jinserk

I understand that there is currently no official support for hdf5 for sparse matrices.(This is not impossible to implement. Actually, there is currently an implementation such as https://github.com/appier/h5sparse) Therefore, the official pytable document also recommends compressing for sparse matrix.

In fact, according to many sources, it is recommended to use compression when using the hdf5 format. (https://stackoverflow.com/a/25678471/5350490) According to this, since it is a sparse matrix, the original 512MB of data can be compressed up to 4.5KB. So, can you experiment with your 400GB data and give you the final compressed result size? Please try to compression='gzip' with level=9 and let me know how many sizes are compressed!

In addition, apart from this, we will add a mechanism for sparse to our long-term plans!!