graykode / matorage

Matorage is tensor(multidimensional matrix) object storage manager for deep learning framework(Pytorch, Tensorflow V2, Keras)
https://matorage.readthedocs.io
Other
73 stars 8 forks source link

no metadata dir in a compressed bucket #18

Open jinserk opened 4 years ago

jinserk commented 4 years ago

Hi again,

Sorry for bothering you with several question and bug report, but this looks critical. I made a compressed data bucket and it looks storing well, but when I retrieve the dataset, it has 0 len as follows:

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 88, in set_datasets
    print(dataset[0])
  File "/home/jinserk/kyu/kyumlm/tddft/ann/dataset.py", line 35, in __getitem__
    x = super().__getitem__(index)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 81, in __getitem__
    return self._get_item_with_download(idx)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 89, in _get_item_with_download
    _objectname, _relative_index = self._find_object(idx)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 128, in _find_object
    _key = self.end_indices[_key_idx]
IndexError: list index out of range

I've checked briefly, and found that the bucket has no metadata to read out the meta info of the dataset. Can you fix this error? I have installed the latest master branch code.

jinserk commented 4 years ago

One more minor error I found was, when I export the DataConfig to json, itemsize info of a DataAttribute was not exported. Of course I can add it manually.

graykode commented 4 years ago

@jinserk

No, a lot of questions on this project don't bother me. Rather, I am happy to think that this project can be improved.

First question: If there is no information related to the metadata, it means that the save was accidentally broken in the middle. Therefore, it seems necessary to create a metadata recover function for this case. Or maybe you have forgot datasaver.disconnect.

Second question : Yes, itemsize is missed. I'll add this part as soon as possible.

To fixed

So, for the first question, please double check that the code was written correctly before modifying this part.

jinserk commented 4 years ago

@graykode You're right! I forgot datasaver.disconnect. Thank you so much! By the way, is this disconnect not able to be called from DataSaver.__del__() automatically?

graykode commented 4 years ago

@jinserk Thanks for the great suggestion.

As you suggested, adding it to DataSaver.__del__() doesn't seem to have any problem in terms of concurrency(multiprocessing). I will reflect on this. Thanks!

graykode commented 4 years ago

@jinserk

The python destructor is not a function that is triggered when the class ends. Therefore, it seems more efficient to manage with python's Context Manager (__enter()__, __exit__) :

with DataSaver(...) as datasaver:
   datasave(...)
jinserk commented 4 years ago

Looks great! I thought that __del__ is called when the instance destructed but it doesn't.. sorry for making you confused!