A few questions for the usage

jinserk commented 4 years ago

It's really fantastic! Thank you so much for sharing this project. I had a quick test with minio docker process and confirmed it works really well as expected. I'd like to ask a few questions about the usage:

can I save any objects other than tensors, i.e. a tuple of tensors, a dict or a sparse tensors? In this case how to give the attributes?
If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it?
If I use this in a distributed training, will the each dataloader occupy the amount of dataset? For example, if I'd like to use this with PyTorch DDP, then each process will have its own DataLoader and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not.
It looks like supporting only PyTorch and TensorFlow now, but how about the numpy array or matrix for scikit-learn or XGBoost? Can I store some numpy objects as well?

graykode commented 4 years ago

@jinserk Thank you for your interest in the project!

can I save any objects other than tensors, i.e. a tuple of tensors, a dict or a sparse tensors? In this case how to give the attributes?: As can be seen in the existing example, tensors in the form of tuple or dict can also be stored. We do not define a new operation for sparse sensors, but we can save it. As for the tensor for sparse, we will add it to the long-term plan. Below is an example of storage in the form of a tuple :
```
attributes=[
    ('image', 'float32', (1, 28, 28)),
    ('target', 'int64', (1))
]
```

traindata_saver({ 'image': image, 'target': target })

- If I add some more data samples to an existing dataset (in the case that periodically data samples could be added so if I have to refresh all the datasets with the added samples), will it be okay to add to the dataset and save it? : If you simply add more data (append mode) does not matter if you save using the existing config. However, if you want to refresh your data, it is not currently implemented in code. If you want to refresh dataset (this means the same as you want to remove buckets), you should use minio web site or [mc command](https://github.com/minio/mc) from minio. (`mc rb --force --dangerous local/<bucket_name>`). I will further implement the refresh method by adding a new option on datasaver like this:
```python
traindata_saver({
        'image': image,
        'target': target
    }, refresh=True) # I will add this argument

If I use this in a distributed training, will the each dataloader occupy the amount of dataset? For example, if I'd like to use this with PyTorch DDP, then each process will have its own DataLoader and will load the data samples during the training -- so I wonder the loaded dataset size will be multiple of the original dataset size per process or not. :I understood this question as follows: With distributed data parallel (DDP), N data is replicated as many nodes as M, and the total number of N*M data is not generated? Yes, that's right. When dataset of torch.matorage is initialized, it goes through a logic to pre-download data corresponding to config (https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L80). However, this is very inefficient, So we recommend using the network access storage (NAS) option when using DDP. No new downloads of data are available when using NAS. Therefore, N datasets can be kept intact and DDP train can be performed.
It looks like supporting only PyTorch and TensorFlow now, but how about the numpy array or matrix for scikit-learn or XGBoost? Can I store some numpy objects as well? : When the tensor in Pytorch(torch.tensor) and Tensorflow(tf.Tensor) are saved, they are converted to numpy and saved. (As with tfrecord, the tensor can be encoded in the proto-buf format, but because of its lack of universality, the most popular numpy format was used.) Therefore, if the storage format is numpy array, you can save both scikit-learn or XGBOost. But of course, when it comes to data loaders, new additional implementations are needed. See how numpy array to be save! : (https://github.com/graykode/matorage/blob/master/tests/test_datasaver.py#L247)

Best Regard Tae Hwan

jinserk commented 4 years ago

@graykode Thank you very much for the detailed answers! It's really helpful, and very impressive.

My first question is actually related to the heterogeneous shape of tensors, which means the case that the image size, in your mnist example, could be changed as sample by sample. Practically I'm working on a chemical problem -- molecule classification in chemistry or pharma companies -- and the input feature can be graphs whose sizes are various according to the molecules. I know this cannot be simply implemented using an attributes, and that's why I asked about the sparse matrices support. Hope this could be implemented and be used sooner! :)

graykode commented 4 years ago

@jinserk Matrices with atypical shapes are difficult to store regardless of sparse. Moreover, sparse is not difficult to implement because it is guaranteed by through scipy(https://github.com/appier/h5sparse). However, it is very difficult to store a tensor with an undefined shape as hdf5.

I have a question, in order to make a model of pytorch, all input shapes must be the same, but I am curious how a tensor of heterogeneous shape can be an input of a model.

jinserk commented 4 years ago

Good question. Basically I'm using a fixed shape for the input of the model. In the training, I just pad the heterogeneous shapes of the input with a fixed shape of max dim values. I made a quick test to store all my dataset in the form of padded dense matrices and got almost 100 times bigger stored file, which is totally impractical.

graykode commented 4 years ago

If so, how about storing the fixed tensor itself that goes into the model's input in matorage? This is the core idea of matorage. Also using high compressor leve(7~9) can helps sparse matrix to store better.

jinserk commented 4 years ago

Thanks for the suggestion, @graykode ! I will try to do so, since I don't know how the "compression" will work well. I had once tried to store the fixed tensors but the serialized file was almost 400 GB (used torch.save) while the file with sparse tensor was only 4 GB. I still hope the storing sparse tensor will be supported in this matorage project soon. :)

graykode commented 4 years ago

@jinserk

I understand that there is currently no official support for hdf5 for sparse matrices.(This is not impossible to implement. Actually, there is currently an implementation such as https://github.com/appier/h5sparse) Therefore, the official pytable document also recommends compressing for sparse matrix.

In fact, according to many sources, it is recommended to use compression when using the hdf5 format. (https://stackoverflow.com/a/25678471/5350490) According to this, since it is a sparse matrix, the original 512MB of data can be compressed up to 4.5KB. So, can you experiment with your 400GB data and give you the final compressed result size? Please try to compression='gzip' with level=9 and let me know how many sizes are compressed!

In addition, apart from this, we will add a mechanism for sparse to our long-term plans!!

graykode / matorage

A few questions for the usage #1