[RLLib] New API stack has several bugs in steps trained/sampled reporting.

What happened + What you expected to happen

I ran the custom_env.py example and saw num_env_steps_trained = 0 in the output.
I also found this discus post on a similar issue: https://discuss.ray.io/t/num-env-agent-steps-trained-0-even-though-steps-sampled/11730/2
The reward seems to increase, so presumably it does actually train though?
Here is the full output from running the example:
$ RLLIB_NUM_GPUS=0 python ray_demo.py
2023-08-17 11:19:00,400 WARNING deprecation.py:50 -- DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
Running with following CLI options: Namespace(run='PPO', framework='torch', as_test=False, stop_iters=50, stop_timesteps=100000, stop_reward=0.1, no_tune=False, local_mode=False)
2023-08-17 11:19:02,010 INFO worker.py:1621 -- Started a local Ray instance.
2023-08-17 11:19:02,824 WARNING deprecation.py:50 -- DeprecationWarning: `build_tf_policy` has been deprecated. This will raise an error in the future!
2023-08-17 11:19:02,825 WARNING deprecation.py:50 -- DeprecationWarning: `build_policy_class` has been deprecated. This will raise an error in the future!
2023-08-17 11:19:02,844 WARNING algorithm_config.py:2558 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
/home/alex/.local/share/virtualenvs/ray-YIb7jMq2/lib/python3.10/site-packages/gymnasium/spaces/box.py:127: UserWarning: WARN: Box bound precision lowered by casting to float32
  logger.warn(f"Box bound precision lowered by casting to {self.dtype}")
/home/alex/.local/share/virtualenvs/ray-YIb7jMq2/lib/python3.10/site-packages/gymnasium/utils/passive_env_checker.py:141: UserWarning: WARN: The obs returned by the `reset()` method was expecting numpy array dtype to be float32, actual type: float64
  logger.warn(
/home/alex/.local/share/virtualenvs/ray-YIb7jMq2/lib/python3.10/site-packages/gymnasium/utils/passive_env_checker.py:165: UserWarning: WARN: The obs returned by the `reset()` method is not within the observation space.
  logger.warn(f"{pre} is not within the observation space.")
2023-08-17 11:19:02,864 WARNING algorithm_config.py:2558 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
2023-08-17 11:19:02,873 WARNING algorithm_config.py:2558 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
Training automatically with Ray Tune
2023-08-17 11:19:02,875 INFO tune.py:666 -- [output] This will use the new output engine with verbosity 1. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. For more information, please see https://github.com/ray-project/ray/issues/36949
2023-08-17 11:19:02,875 WARNING syncer.py:260 -- You are using remote storage, but you don't have `fsspec` installed. This can lead to inefficient syncing behavior. To avoid this, install fsspec with `pip install fsspec`. Depending on your remote storage provider, consider installing the respective fsspec-package (see https://github.com/fsspec).
╭────────────────────────────────────────────────────────╮
│ Configuration for experiment     PPO                   │
├────────────────────────────────────────────────────────┤
│ Search algorithm                 BasicVariantGenerator │
│ Scheduler                        FIFOScheduler         │
│ Number of trials                 1                     │
╰────────────────────────────────────────────────────────╯

View detailed results here: /home/alex/ray_results/PPO
To visualize your results with TensorBoard, run: `tensorboard --logdir /home/alex/ray_results/PPO`

2023-08-17 11:19:02,889 WARNING algorithm_config.py:2558 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
Trial status: 1 PENDING
Current time: 2023-08-17 11:19:03. Total running time: 0s
Logical resource usage: 2.0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:G)
╭───────────────────────────────────────────╮
│ Trial name                       status   │
├───────────────────────────────────────────┤
│ PPO_SimpleCorridor_65ec2_00000   PENDING  │
╰───────────────────────────────────────────╯

(pid=1954927) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
(PPO pid=1954927) 2023-08-17 11:19:05,083       WARNING algorithm_config.py:2558 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`.
(PPO pid=1954927) 2023-08-17 11:19:05,083       WARNING algorithm_config.py:656 -- Cannot create PPOConfig from given `config_dict`! Property __stdout_file__ not supported.
(RolloutWorker pid=1954982) 2023-08-17 11:19:07,224     WARNING env.py:162 -- Your env doesn't have a .spec.max_episode_steps attribute. Your horizon will default to infinity, and your environment will not be reset.
(RolloutWorker pid=1954982) /home/alex/.local/share/virtualenvs/ray-YIb7jMq2/lib/python3.10/site-packages/gymnasium/spaces/box.py:227: UserWarning: WARN: Casting input x to numpy array.
(RolloutWorker pid=1954982)   logger.warn("Casting input x to numpy array.")
(RolloutWorker pid=1954982) 2023-08-17 11:19:07,231     WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=1954982) 2023-08-17 11:19:07,231     WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=1954982) 2023-08-17 11:19:07,231     WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
(RolloutWorker pid=1954982) 2023-08-17 11:19:07,231     WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
Training started with configuration:
╭───────────────────────────────────────────────────────────────────────────╮
│ Training config                                                           │
├───────────────────────────────────────────────────────────────────────────┤
│ _AlgorithmConfig__prior_exploration_config/type        StochasticSampling │
│ _disable_action_flattening                                          False │
│ _disable_execution_plan_api                                          True │
│ _disable_initialize_loss_from_dummy_batch                           False │
│ _disable_preprocessor_api                                           False │
│ _enable_learner_api                                                  True │
│ _enable_rl_module_api                                                True │
│ _fake_gpus                                                          False │
│ _is_atari                                                                 │
│ _learner_class                                                            │
│ _tf_policy_handles_more_than_one_loss                               False │
│ action_mask_key                                               action_mask │
│ action_space                                                              │
│ actions_in_input_normalized                                         False │
│ always_attach_evaluation_results                                    False │
│ auto_wrap_old_gym_envs                                               True │
│ batch_mode                                              truncate_episodes │
│ callbacks                                            ...efaultCallbacks'> │
│ checkpoint_trainable_policies_only                                  False │
│ clip_actions                                                        False │
│ clip_param                                                            0.3 │
│ clip_rewards                                                              │
│ compress_observations                                               False │
│ count_steps_by                                                  env_steps │
│ create_env_on_driver                                                False │
│ custom_eval_function                                                      │
│ delay_between_worker_restarts_s                                       60. │
│ disable_env_checking                                                False │
│ eager_max_retraces                                                     20 │
│ eager_tracing                                                        True │
│ enable_async_evaluation                                             False │
│ enable_connectors                                                    True │
│ enable_tf1_exec_eagerly                                             False │
│ entropy_coeff                                                          0. │
│ entropy_coeff_schedule                                                    │
│ env                                                  ....SimpleCorridor'> │
│ env_config/corridor_length                                              5 │
│ env_runner_cls                                                            │
│ env_task_fn                                                               │
│ evaluation_config                                                         │
│ evaluation_duration                                                    10 │
│ evaluation_duration_unit                                         episodes │
│ evaluation_interval                                                       │
│ evaluation_num_workers                                                  0 │
│ evaluation_parallel_to_training                                     False │
│ evaluation_sample_timeout_s                                          180. │
│ explore                                                              True │
│ export_native_model_files                                           False │
│ fake_sampler                                                        False │
│ framework                                                           torch │
│ gamma                                                                0.99 │
│ grad_clip                                                                 │
│ grad_clip_by                                                  global_norm │
│ ignore_worker_failures                                              False │
│ in_evaluation                                                       False │
│ input                                                             sampler │
│ keep_per_episode_custom_metrics                                     False │
│ kl_coeff                                                              0.2 │
│ kl_target                                                            0.01 │
│ lambda                                                                 1. │
│ local_gpu_idx                                                           0 │
│ local_tf_session_args/inter_op_parallelism_threads                      8 │
│ local_tf_session_args/intra_op_parallelism_threads                      8 │
│ log_level                                                            WARN │
│ log_sys_usage                                                        True │
│ logger_config                                                             │
│ logger_creator                                                            │
│ lr                                                                0.00005 │
│ lr_schedule                                                               │
│ max_num_worker_restarts                                              1000 │
│ max_requests_in_flight_per_sampler_worker                               2 │
│ metrics_episode_collection_timeout_s                                  60. │
│ metrics_num_episodes_for_smoothing                                    100 │
│ min_sample_timesteps_per_iteration                                      0 │
│ min_time_s_per_iteration                                                  │
│ min_train_timesteps_per_iteration                                       0 │
│ model/_disable_action_flattening                                    False │
│ model/_disable_preprocessor_api                                     False │
│ model/_time_major                                                   False │
│ model/_use_default_native_models                                       -1 │
│ model/always_check_shapes                                           False │
│ model/attention_dim                                                    64 │
│ model/attention_head_dim                                               32 │
│ model/attention_init_gru_gate_bias                                    2.0 │
│ model/attention_memory_inference                                       50 │
│ model/attention_memory_training                                        50 │
│ model/attention_num_heads                                               1 │
│ model/attention_num_transformer_units                                   1 │
│ model/attention_position_wise_mlp_dim                                  32 │
│ model/attention_use_n_prev_actions                                      0 │
│ model/attention_use_n_prev_rewards                                      0 │
│ model/conv_activation                                                relu │
│ model/conv_filters                                                        │
│ model/custom_action_dist                                                  │
│ model/custom_model                                                        │
│ model/custom_preprocessor                                                 │
│ model/dim                                                              84 │
│ model/encoder_latent_dim                                                  │
│ model/fcnet_activation                                               tanh │
│ model/fcnet_hiddens                                            [256, 256] │
│ model/framestack                                                     True │
│ model/free_log_std                                                  False │
│ model/grayscale                                                     False │
│ model/lstm_cell_size                                                  256 │
│ model/lstm_use_prev_action                                          False │
│ model/lstm_use_prev_action_reward                                      -1 │
│ model/lstm_use_prev_reward                                          False │
│ model/max_seq_len                                                      20 │
│ model/no_final_linear                                               False │
│ model/post_fcnet_activation                                          relu │
│ model/post_fcnet_hiddens                                               [] │
│ model/use_attention                                                 False │
│ model/use_lstm                                                      False │
│ model/vf_share_layers                                               False │
│ model/zero_mean                                                      True │
│ normalize_actions                                                    True │
│ num_consecutive_worker_failures_tolerance                             100 │
│ num_cpus_for_driver                                                     1 │
│ num_cpus_per_learner_worker                                             1 │
│ num_cpus_per_worker                                                     1 │
│ num_envs_per_worker                                                     1 │
│ num_gpus                                                                0 │
│ num_gpus_per_learner_worker                                             0 │
│ num_gpus_per_worker                                                     0 │
│ num_learner_workers                                                     0 │
│ num_sgd_iter                                                           30 │
│ num_workers                                                             1 │
│ observation_filter                                               NoFilter │
│ observation_fn                                                            │
│ observation_space                                                         │
│ offline_sampling                                                    False │
│ ope_split_batch_by_episode                                           True │
│ output                                                                    │
│ output_compress_columns                                ['obs', 'new_obs'] │
│ output_max_file_size                                             67108864 │
│ placement_strategy                                                   PACK │
│ policies/default_policy                              ...None, None, None) │
│ policies_to_train                                                         │
│ policy_map_cache                                                       -1 │
│ policy_map_capacity                                                   100 │
│ policy_mapping_fn                                    ...t 0x7febe6271360> │
│ policy_states_are_swappable                                         False │
│ postprocess_inputs                                                  False │
│ preprocessor_pref                                                deepmind │
│ recreate_failed_workers                                             False │
│ remote_env_batch_wait_ms                                                0 │
│ remote_worker_envs                                                  False │
│ render_env                                                          False │
│ replay_sequence_length                                                    │
│ restart_failed_sub_environments                                     False │
│ rl_module_spec                                                            │
│ rollout_fragment_length                                              auto │
│ sample_async                                                        False │
│ sample_collector                                     ...leListCollector'> │
│ sampler_perf_stats_ema_coef                                               │
│ seed                                                                      │
│ sgd_minibatch_size                                                    128 │
│ shuffle_buffer_size                                                     0 │
│ shuffle_sequences                                                    True │
│ simple_optimizer                                                       -1 │
│ sync_filters_on_rollout_workers_timeout_s                             60. │
│ synchronize_filters                                                    -1 │
│ tf_session_args/allow_soft_placement                                 True │
│ tf_session_args/device_count/CPU                                        1 │
│ tf_session_args/gpu_options/allow_growth                             True │
│ tf_session_args/inter_op_parallelism_threads                            2 │
│ tf_session_args/intra_op_parallelism_threads                            2 │
│ tf_session_args/log_device_placement                                False │
│ torch_compile_learner                                               False │
│ torch_compile_learner_dynamo_backend                             inductor │
│ torch_compile_learner_dynamo_mode                                         │
│ torch_compile_learner_what_to_compile                ...ile.FORWARD_TRAIN │
│ torch_compile_worker                                                False │
│ torch_compile_worker_dynamo_backend                                onnxrt │
│ torch_compile_worker_dynamo_mode                                          │
│ train_batch_size                                                     4000 │
│ update_worker_filter_stats                                           True │
│ use_critic                                                           True │
│ use_gae                                                              True │
│ use_kl_loss                                                          True │
│ use_worker_filter_stats                                              True │
│ validate_workers_after_construction                                  True │
│ vf_clip_param                                                         10. │
│ vf_loss_coeff                                                          1. │
│ vf_share_layers                                                        -1 │
│ worker_cls                                                             -1 │
│ worker_health_probe_timeout_s                                          60 │
│ worker_restore_timeout_s                                             1800 │
╰───────────────────────────────────────────────────────────────────────────╯

Training finished iteration 1 at 2023-08-17 11:19:15. Total running time: 12s
╭────────────────────────────────────────────────╮
│ Training result                                │
├────────────────────────────────────────────────┤
│ episodes_total                              80 │
│ num_env_steps_sampled                     4000 │
│ num_env_steps_trained                        0 │
│ sampler_results/episode_len_mean       49.6375 │
│ sampler_results/episode_reward_mean   -3.93069 │
╰────────────────────────────────────────────────╯

Training finished iteration 2 at 2023-08-17 11:19:23. Total running time: 20s
╭────────────────────────────────────────────────╮
│ Training result                                │
├────────────────────────────────────────────────┤
│ episodes_total                             251 │
│ num_env_steps_sampled                     8000 │
│ num_env_steps_trained                        0 │
│ sampler_results/episode_len_mean       23.5556 │
│ sampler_results/episode_reward_mean   -1.35414 │
╰────────────────────────────────────────────────╯

Training finished iteration 3 at 2023-08-17 11:19:31. Total running time: 28s
╭──────────────────────────────────────────────────╮
│ Training result                                  │
├──────────────────────────────────────────────────┤
│ episodes_total                               586 │
│ num_env_steps_sampled                      12000 │
│ num_env_steps_trained                          0 │
│ sampler_results/episode_len_mean         11.8627 │
│ sampler_results/episode_reward_mean   -0.0530459 │
╰──────────────────────────────────────────────────╯

Trial status: 1 RUNNING
Current time: 2023-08-17 11:19:33. Total running time: 30s
Logical resource usage: 2.0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:G)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                       status       iter     total time (s)      ts       reward     episode_reward_max     episode_reward_min     episode_len_mean     episodes_this_iter │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_SimpleCorridor_65ec2_00000   RUNNING         3            23.7367   12000   -0.0530459                1.55755               -3.55418              11.8627                    335 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Training finished iteration 4 at 2023-08-17 11:19:39. Total running time: 36s
╭────────────────────────────────────────────────╮
│ Training result                                │
├────────────────────────────────────────────────┤
│ episodes_total                            1058 │
│ num_env_steps_sampled                    16000 │
│ num_env_steps_trained                        0 │
│ sampler_results/episode_len_mean       8.51059 │
│ sampler_results/episode_reward_mean   0.246054 │
╰────────────────────────────────────────────────╯

Training saved a checkpoint for iteration 4 at: /home/alex/ray_results/PPO/PPO_SimpleCorridor_65ec2_00000_0_2023-08-17_11-19-02/checkpoint_000004

Training completed after 4 iterations at 2023-08-17 11:19:39. Total running time: 36s

Trial status: 1 TERMINATED
Current time: 2023-08-17 11:19:39. Total running time: 36s
Logical resource usage: 2.0/32 CPUs, 0/4 GPUs (0.0/1.0 accelerator_type:G)
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name                       status         iter     total time (s)      ts     reward     episode_reward_max     episode_reward_min     episode_len_mean     episodes_this_iter │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_SimpleCorridor_65ec2_00000   TERMINATED        4            31.9572   16000   0.246054                1.59666               -3.47835              8.51059                    472 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

(pid=1954982) DeprecationWarning: `DirectStepOptimizer` has been deprecated. This will raise an error in the future!
(PPO pid=1954927) 2023-08-17 11:19:07,243       WARNING algorithm_config.py:2558 -- Setting `exploration_config={}` because you set `_enable_rl_module_api=True`. When RLModule API are enabled, exploration_config can not be set. If you want to implement custom exploration behaviour, please modify the `forward_exploration` method of the RLModule at hand. On configs that have a default exploration config, this must be done with `config.exploration_config={}`. [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(PPO pid=1954927) 2023-08-17 11:19:07,248       WARNING deprecation.py:50 -- DeprecationWarning: `ValueNetworkMixin` has been deprecated. This will raise an error in the future!
(PPO pid=1954927) 2023-08-17 11:19:07,248       WARNING deprecation.py:50 -- DeprecationWarning: `LearningRateSchedule` has been deprecated. This will raise an error in the future!
(PPO pid=1954927) 2023-08-17 11:19:07,248       WARNING deprecation.py:50 -- DeprecationWarning: `EntropyCoeffSchedule` has been deprecated. This will raise an error in the future!
(PPO pid=1954927) 2023-08-17 11:19:07,248       WARNING deprecation.py:50 -- DeprecationWarning: `KLCoeffMixin` has been deprecated. This will raise an error in the future!
Versions / Dependencies

Python 3.10 on Ubuntu via WSL
Ray version is 2.6.3
Reproduction script

Reproduced with https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py
Issue Severity

Unclear
ray-project / ray

[RLLib] New API stack has several bugs in steps trained/sampled reporting. #38560

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity