Closed wenxuhaskell closed 8 months ago
@wenxuhaskell Hi, thanks for the issue! I think I've fixed this issue at this latest commit: https://github.com/takuseno/d3rlpy/commit/11adcee98811d9eb7e6b084009b1da9b20b91ac5 . If you pull the latest master, the issue should be resolved. Good catch!
@takuseno The latest master caused some runtime error, but I am not sure if it is a bug or any inconsistency to my environment (i.e., pytorch version).
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
The code for experiment is still the same,
print(f"device: {device}")
my_encoder_factory = d3rlpy.models.encoders.VectorEncoderFactory(hidden_units=[128,64,32], use_batch_norm=True)
# setup algorithm
cql = d3rlpy.algos.CQLConfig(
actor_learning_rate=1e-3,
critic_learning_rate=1e-3,
alpha_learning_rate=1e-3,
actor_encoder_factory=my_encoder_factory,
critic_encoder_factory=my_encoder_factory
).create(device=device)
# prepare dataset
The terminal output is pasted as below,
root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Start running on rank=0.
device: cuda:0
2024-01-18 14:58.51 [info ] Signatures have been automatically determined. action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]) reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)])
2024-01-18 14:58.51 [info ] Action-space has been automatically determined. action_space=<ActionSpace.CONTINUOUS: 1> distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [info ] Action size has been automatically determined. action_size=1 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [info ] dataset info dataset_info=DatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=1) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [info ] Directory is created at d3rlpy_logs/CQL_20240118145851 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.51 [debug ] Building models... distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.52 [debug ] Models have been built. distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1)
2024-01-18 14:58.52 [info ] Parameters distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) params={'observation_shape': [3], 'action_size': 1, 'config': {'type': 'cql', 'params': {'batch_size': 256, 'gamma': 0.99, 'observation_scaler': {'type': 'none', 'params': {}}, 'action_scaler': {'type': 'none', 'params': {}}, 'reward_scaler': {'type': 'none', 'params': {}}, 'actor_learning_rate': 0.001, 'critic_learning_rate': 0.001, 'temp_learning_rate': 0.0001, 'alpha_learning_rate': 0.001, 'actor_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'critic_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'temp_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'alpha_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'actor_encoder_factory': {'type': 'vector', 'params': {'hidden_units': [128, 64, 32], 'activation': 'relu', 'use_batch_norm': True, 'dropout_rate': None, 'exclude_last_activation': False, 'last_activation': None}}, 'critic_encoder_factory': {'type': 'vector', 'params': {'hidden_units': [128, 64, 32], 'activation': 'relu', 'use_batch_norm': True, 'dropout_rate': None, 'exclude_last_activation': False, 'last_activation': None}}, 'q_func_factory': {'type': 'mean', 'params': {'share_encoder': False}}, 'tau': 0.005, 'n_critics': 2, 'initial_temperature': 1.0, 'initial_alpha': 1.0, 'alpha_threshold': 10.0, 'conservative_weight': 5.0, 'n_action_samples': 10, 'soft_q_backup': False}}}
Epoch 1/10: 0%| | 0/1000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/code/xxx/distributed_offline_training.py", line 65, in
distributed_offline_training.py FAILED
Updating to 2.4.0 fixed the bug for me~
@wqp89324 Thanks for the check! There is a chance that you could get an error, depending on datasets. Feel free to reopen this issue if there is any further discussion.
Describe the bug When enabling "use_batch_norm" in VectorEncoderFactory(..., use_batch_norm_True, ...), error takes place when building the model.
I found this error when trying to customize the VectorEncoderFactory(). But I modified distributed_offline_learning.py to reproduce the same error and hope it to be easier for you to reproduce it and investigate it.
But the error has nothing to do with distributed training.
To Reproduce
In distributed_offline_learning.py, do the changes as below,
Then run the command below (using 1 process only),
root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py
Terminal output root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Start running on rank=0. device: cuda:0 2024-01-18 07:01.56 [info ] Signatures have been automatically determined. action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]) reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) 2024-01-18 07:01.56 [info ] Action-space has been automatically determined. action_space=<ActionSpace.CONTINUOUS: 1> distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [info ] Action size has been automatically determined. action_size=1 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [info ] dataset info dataset_info=DatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=1) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [info ] Directory is created at d3rlpy_logs/CQL_20240118070156 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [debug ] Building models... distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) Traceback (most recent call last): File "/home/code/xxx/distributed_offline_training.py", line 65, in
main()
File "/home/code/xxx/distributed_offline_training.py", line 51, in main
cql.fit(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 400, in fit
results = list(
^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 491, in fitter
self.create_impl(observation_shape, action_size)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/base.py", line 311, in create_impl
self.inner_create_impl(observation_shape, action_size)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/cql.py", line 137, in inner_create_impl
policy = create_normal_policy(
^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/builders.py", line 170, in create_normal_policy
hidden_size = compute_output_size([observation_shape], encoder)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 288, in compute_output_size
y = encoder(inputs)
^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 28, in call
return super().call(x)
^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 223, in forward
return self._layers(x)
^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, *kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward
input = module(input)
^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
return F.batch_norm(
^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/functional.py", line 2448, in batch_norm
_verify_batch_size(input.size())
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/functional.py", line 2416, in _verify_batch_size
raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size))
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128])
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 162984) of binary: /root/.pyenv/versions/3.11.5/bin/python3.11
Traceback (most recent call last):
File "/root/.pyenv/versions/3.11.5/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
distributed_offline_training.py Failed