graykode / matorage

Matorage is tensor(multidimensional matrix) object storage manager for deep learning framework(Pytorch, Tensorflow V2, Keras)
https://matorage.readthedocs.io
Other
73 stars 8 forks source link

AssertionError(assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))) #21

Open jinserk opened 4 years ago

jinserk commented 4 years ago

Can I ask you what this error stands for?

Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 65, in set_datasets
    dataset = MatorageAnnDataset(trainset_config, clear=True)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/torch/dataset.py", line 73, in __init__
    super(Dataset, self).__init__(config, **kwargs)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 80, in __init__
    self._init_download()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/data.py", line 189, in _init_download
    assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype))
AssertionError
graykode commented 4 years ago

Of course. Could you show all files related to metadata? (Represents a file within metadata.)

jinserk commented 4 years ago

Here is the only file in metadata dir. The dataset name and host/port info have been censored for security. Thank you! 6bd037556e8842d6.zip

jinserk commented 4 years ago

If I commented out the assertion, anyway it works to retreive data from the minio server. However, I found lots of annoying loggings as:

2020/08/28 17:16:43 EDT [INFO] mlmanager.torch.workers (workers.py:56) set device cpu as rank 0                                                                                                                                                                                 
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:43 - INFO - matorage.utils - PID: 1074302 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074424 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:46 - INFO - matorage.utils - PID: 1074441 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074487 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:16:47 - INFO - matorage.utils - PID: 1074506 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
2020/08/28 17:17:06 EDT [INFO] mlmanager.torch.workers (workers.py:316) train:  epoch 0001  lr 5.0000e-04  loss 0.191976                                                                                                                                                        
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078057 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078055 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078056 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 -  PyTorch version 1.6.0 available.                                                                                                                                                                                  
08/28/2020 17:17:07 - INFO - matorage.utils - PID: 1078054 -  PyTorch Vision version 0.7.0 available.                                                                                                                                                                           
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:350) validate:  epoch 0001  loss 0.099826                                                                                                                                                                    
2020/08/28 17:17:08 EDT [INFO] mlmanager.torch.workers (workers.py:237) epoch 0001  ave_train_loss 0.191976  ave_val_loss 0.099826             

Can I turn them off? Sorry for lots of questions and bug reports.

graykode commented 4 years ago

@jinserk

Thank you for the detailed bug report!

While analyzing the bug you showed, I was able to find a few more bugs related to the NAS. First, it is a part that cannot read the sub-JSON files of metadata well, which was solved by modifying the list_object function of NAS :

    def list_objects(self, bucket_name, prefix="", recursive=False):
        _foldername = os.path.join(self.path, bucket_name, prefix)
        if not recursive:
            objects = [
                os.path.join(prefix, f) for f in os.listdir(_foldername)
            ]
        else:
            objects = [
                os.path.join(dp, f) for dp, dn, fn in os.walk(_foldername) for f in fn
            ]
        return [Obj(o) for o in objects if o.startswith(prefix)]

The second one is related to assert len(self._object_file_mapper) == (len(self.merged_indexer) + len(self.merged_filetype)). This error has been confirmed to be caused by a mismatch between the metadata on the remote server and the cached metadata.

This 'caching' serves to map the location of the downloaded file and the key of the minio when calling the dataset. If you use the NAS setting, you don't actually need this caching.

Solution

One thing I'd like to ask is, did you use the ip4 address when using the NAS settings?

jinserk commented 4 years ago

Yes I used IPv4 address. I'll check your solution ASAP. Thank you so much for the prompt solution!

graykode commented 4 years ago

@jinserk When using a NAS, you must use a local address rather than ipv4.

For example:

from matorage import DataConfig

# NAS example
data_config = DataConfig(
    endpoint='/tmp/shared',
    dataset_name='mnist',
    additional={
        "framework" : "pytorch",
        "mode" : "training"
    },
    compressor={
        "complevel" : 0,
        "complib" : "zlib"
    },
    attributes=[
        ('image', 'float32', (28, 28)),
        ('target', 'int64', (1, ))
    ]
)

If you use ipv4 for the endpoint, connection is established through HTTP protocol. However, use the local path for the endpoint, It's much faster because it doesn't use the Http protocol. (Just file copy from folder to folder) Also, If you use an http endpoint in the dataloader, data is downloaded to all nodes unconditionally. Check this code might be helpful: https://github.com/graykode/matorage/blob/master/matorage/data/data.py#L178

jinserk commented 4 years ago

Hi @graykode, Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.

By the way, I have two questions related to this:

jinserk commented 4 years ago

It looks I cannot use the local_path addressing and ipv4 addressing at the same time:

Process TrainProcess-1:
Traceback (most recent call last):
  File "/home/jinserk/.pyenv/versions/3.8.5/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 202, in run
    self.setup()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 190, in setup
    self.set_dataloaders()
  File "/home/jinserk/kyu/kyumlm/mlmanager/torch/workers.py", line 134, in set_dataloaders
    trainset, valset = self.set_datasets()
  File "/home/jinserk/kyu/kyumlm/tddft/ann/workers.py", line 64, in set_datasets
    trainset_config = nas.DataConfig.from_json_file("train.json")
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 312, in from_json_file
    return cls(**config_dict)
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 131, in __init__
    self._check_all()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 140, in _check_all
    self._check_bucket()
  File "/home/jinserk/.pyenv/versions/kyumlm/lib/python3.8/site-packages/matorage/data/config.py", line 242, in _check_bucket
    raise ValueError(
ValueError: Already created endpoint(/mnt/hdd1/kyu/matorage) doesn't current endpoint str(127.0.0.1:9000) It may occurs permission denied error
graykode commented 4 years ago

@jinserk

-

Hi @graykode, Thanks for the suggestion. I didn't know there exists such a 'local path' addressing method. I had changed the addressing and currently it seems to work with DataSaver well. I'll check it with Dataset after all the data uploading completed.

By the way, I have two questions related to this:

  • When using the local path addressing, does it work with the minio docker server or access the local path directly? I found that the old files or dirs in the path were owned by root, since the minio server runs with the root permission. However, when I use the local path addressing, newly created files and dirs have my own user permission, which means it could be problematic when I share the newly uploaded dataset with other users on the same server. Am I correct?
  • If the local path addressing uses direct access of the files and dirs, are they also able to be explored or updated through the other IPv4 addressing connection? I mean, if I have a multiple-node configuration for a hugh model training, but if I want to use a directory as the matorage storage on only one root node of them (namely rank0 node here), then can I set the rank0 node as local_path addressing but the other rank nodes uses ipv4 addressing at the same time?
graykode commented 4 years ago

@jinserk

I found the solution related to the first one. This is the way to use binary minio without using minio docker: https://github.com/minio/minio#gnulinux

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
# minio for background running
nohup ./minio gateway nas /home/nlkey2022/shared &

I don't know why we get a permission error in minio docker nas. I will leave an issue on the minio once.

jinserk commented 4 years ago

@graykode Guess this is because when using docker it runs as the root but when using local binary it runs with a user permission. I guess if you're run minio with the root permission, it will be the same:

sudo -H nohup ./minio gateway nas /home/nlkey2022/shared &

In my quick and humble opinion, we need to check the minio's set_bucket_policy to set the files or dirs to public. Please check here even though it's minio-java, not the minio-py. Of course I could be wrong and I'm afraid of misleading.

graykode commented 4 years ago

@jinserk

I don't actually know the detailed configuration of the minio. So I will consider it. Thank you.

I'll leave a thread when I find more options!! :)

graykode commented 3 years ago

A step-by-step look at why this error occurs is as follows.

  1. In the dataset, the minio was updated with the same dataset_name and dataset_additional.
  2. However, json cached locally, that is, files in the ~/.matorage folder are not updated.
  3. Currently, the files in the ~/.matorage folder must be manually deleted, but the related logic must be additionally implemented later.