takuseno / d3rlpy

An offline deep reinforcement learning library
https://takuseno.github.io/d3rlpy
MIT License
1.33k stars 244 forks source link

[BUG] Enabling "use_batch_norm" in VectorEncoderFactory(..., use_batch_norm=True, ...) leads to error #370

Closed wenxuhaskell closed 8 months ago

wenxuhaskell commented 10 months ago

Describe the bug When enabling "use_batch_norm" in VectorEncoderFactory(..., use_batch_norm_True, ...), error takes place when building the model.

I found this error when trying to customize the VectorEncoderFactory(). But I modified distributed_offline_learning.py to reproduce the same error and hope it to be easier for you to reproduce it and investigate it.

But the error has nothing to do with distributed training.

To Reproduce

In distributed_offline_learning.py, do the changes as below,

    print(f"device: {device}")

    my_encoder_factory = d3rlpy.models.encoders.VectorEncoderFactory(hidden_units=[128,64,32], use_batch_norm=True)
    # setup algorithm
    cql = d3rlpy.algos.CQLConfig(
        actor_learning_rate=1e-3,
        critic_learning_rate=1e-3,
        alpha_learning_rate=1e-3,
        actor_encoder_factory=my_encoder_factory,
        critic_encoder_factory=my_encoder_factory
    ).create(device=device)

    # prepare dataset

Then run the command below (using 1 process only),

root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py

Terminal output root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Start running on rank=0. device: cuda:0 2024-01-18 07:01.56 [info ] Signatures have been automatically determined. action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]) reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) 2024-01-18 07:01.56 [info ] Action-space has been automatically determined. action_space=<ActionSpace.CONTINUOUS: 1> distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [info ] Action size has been automatically determined. action_size=1 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [info ] dataset info dataset_info=DatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=1) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [info ] Directory is created at d3rlpy_logs/CQL_20240118070156 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 07:01.56 [debug ] Building models... distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) Traceback (most recent call last): File "/home/code/xxx/distributed_offline_training.py", line 65, in main() File "/home/code/xxx/distributed_offline_training.py", line 51, in main cql.fit( File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 400, in fit results = list( ^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 491, in fitter self.create_impl(observation_shape, action_size) File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/base.py", line 311, in create_impl self.inner_create_impl(observation_shape, action_size) File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/cql.py", line 137, in inner_create_impl policy = create_normal_policy( ^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/builders.py", line 170, in create_normal_policy hidden_size = compute_output_size([observation_shape], encoder) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 288, in compute_output_size y = encoder(inputs) ^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 28, in call return super().call(x) ^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/models/torch/encoders.py", line 223, in forward return self._layers(x) ^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) ^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward return F.batch_norm( ^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/functional.py", line 2448, in batch_norm _verify_batch_size(input.size()) File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/nn/functional.py", line 2416, in _verify_batch_size raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 128]) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 162984) of binary: /root/.pyenv/versions/3.11.5/bin/python3.11 Traceback (most recent call last): File "/root/.pyenv/versions/3.11.5/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

distributed_offline_training.py Failed

takuseno commented 10 months ago

@wenxuhaskell Hi, thanks for the issue! I think I've fixed this issue at this latest commit: https://github.com/takuseno/d3rlpy/commit/11adcee98811d9eb7e6b084009b1da9b20b91ac5 . If you pull the latest master, the issue should be resolved. Good catch!

wenxuhaskell commented 10 months ago

@takuseno The latest master caused some runtime error, but I am not sure if it is a bug or any inconsistency to my environment (i.e., pytorch version).

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

The code for experiment is still the same,

print(f"device: {device}")

    my_encoder_factory = d3rlpy.models.encoders.VectorEncoderFactory(hidden_units=[128,64,32], use_batch_norm=True)
    # setup algorithm
    cql = d3rlpy.algos.CQLConfig(
        actor_learning_rate=1e-3,
        critic_learning_rate=1e-3,
        alpha_learning_rate=1e-3,
        actor_encoder_factory=my_encoder_factory,
        critic_encoder_factory=my_encoder_factory
    ).create(device=device)

    # prepare dataset

The terminal output is pasted as below,

root@:/home/code/xxx# torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=localhost:29400 distributed_offline_training.py master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Start running on rank=0. device: cuda:0 2024-01-18 14:58.51 [info ] Signatures have been automatically determined. action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]) reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]) 2024-01-18 14:58.51 [info ] Action-space has been automatically determined. action_space=<ActionSpace.CONTINUOUS: 1> distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 14:58.51 [info ] Action size has been automatically determined. action_size=1 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 14:58.51 [info ] dataset info dataset_info=DatasetInfo(observation_signature=Signature(dtype=[dtype('float32')], shape=[(3,)]), action_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), reward_signature=Signature(dtype=[dtype('float32')], shape=[(1,)]), action_space=<ActionSpace.CONTINUOUS: 1>, action_size=1) distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 14:58.51 [info ] Directory is created at d3rlpy_logs/CQL_20240118145851 distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 14:58.51 [debug ] Building models... distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 14:58.52 [debug ] Models have been built. distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) 2024-01-18 14:58.52 [info ] Parameters distributed=DistributedWorkerInfo(rank=0, backend='nccl', world_size=1) params={'observation_shape': [3], 'action_size': 1, 'config': {'type': 'cql', 'params': {'batch_size': 256, 'gamma': 0.99, 'observation_scaler': {'type': 'none', 'params': {}}, 'action_scaler': {'type': 'none', 'params': {}}, 'reward_scaler': {'type': 'none', 'params': {}}, 'actor_learning_rate': 0.001, 'critic_learning_rate': 0.001, 'temp_learning_rate': 0.0001, 'alpha_learning_rate': 0.001, 'actor_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'critic_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'temp_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'alpha_optim_factory': {'type': 'adam', 'params': {'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0, 'amsgrad': False}}, 'actor_encoder_factory': {'type': 'vector', 'params': {'hidden_units': [128, 64, 32], 'activation': 'relu', 'use_batch_norm': True, 'dropout_rate': None, 'exclude_last_activation': False, 'last_activation': None}}, 'critic_encoder_factory': {'type': 'vector', 'params': {'hidden_units': [128, 64, 32], 'activation': 'relu', 'use_batch_norm': True, 'dropout_rate': None, 'exclude_last_activation': False, 'last_activation': None}}, 'q_func_factory': {'type': 'mean', 'params': {'share_encoder': False}}, 'tau': 0.005, 'n_critics': 2, 'initial_temperature': 1.0, 'initial_alpha': 1.0, 'alpha_threshold': 10.0, 'conservative_weight': 5.0, 'n_action_samples': 10, 'soft_q_backup': False}}} Epoch 1/10: 0%| | 0/1000 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/code/xxx/distributed_offline_training.py", line 65, in main() File "/home/code/xxx/distributed_offline_training.py", line 51, in main cql.fit( File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 400, in fit results = list( ^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 527, in fitter loss = self.update(batch) ^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 828, in update loss = self._impl.update(torch_batch, self._grad_step) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/torch_utility.py", line 365, in wrapper return f(self, *args, *kwargs) # type: ignore ^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/base.py", line 66, in update return self.inner_update(batch, grad_step) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/torch/ddpg_impl.py", line 119, in inner_update metrics.update(self.update_actor(batch, action)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/d3rlpy/algos/qlearning/torch/ddpg_impl.py", line 109, in update_actor loss.actor_loss.backward() File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True). ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 267792) of binary: /root/.pyenv/versions/3.11.5/bin/python3.11 Traceback (most recent call last): File "/root/.pyenv/versions/3.11.5/bin/torchrun", line 8, in sys.exit(main()) ^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.pyenv/versions/3.11.5/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

distributed_offline_training.py FAILED

wqp89324 commented 9 months ago

Updating to 2.4.0 fixed the bug for me~

takuseno commented 8 months ago

@wqp89324 Thanks for the check! There is a chance that you could get an error, depending on datasets. Feel free to reopen this issue if there is any further discussion.