ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.01k stars 5.45k forks source link

[RLlib] Providing `minibatch_size` to `APPOConfig.training` leads to an `AttributeError` #43464

Open grizzlybearg opened 4 months ago

grizzlybearg commented 4 months ago

What happened + What you expected to happen

I'd like to customize my APPOConfig config dict, specifically the minibatch_size attribute / param for APPOConfig.training. However, providing the minibatch_size to the training method returns the following error:

File "/opt/conda/envs/nuvoenv/lib/python3.11/site-packages/ray/rllib/algorithms/appo/appo.py", line 277, in init (APPO pid=60840) super().init(config, *args, **kwargs) (APPO pid=60840) File "/opt/conda/envs/nuvoenv/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 413, in init (APPO pid=60840) config = default_config.update_from_dict(config) (APPO pid=60840) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APPO pid=60840) File "/opt/conda/envs/nuvoenv/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm_config.py", line 690, in update_from_dict (APPO pid=60840) setattr(self, key, value) (APPO pid=60840) File "/opt/conda/envs/nuvoenv/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm_config.py", line 3500, in setattr (APPO pid=60840) super().setattr(key, value) (APPO pid=60840) AttributeError: property 'minibatch_size' of 'APPOConfig' object has no setter

I'd like clarification as to whether this is by design given that APPO subclasses Impala which has a minibatch_size property: https://github.com/ray-project/ray/blob/10009390b0ff61875b45cbab75052a89332b528e/rllib/algorithms/impala/impala.py#L455-#L466

In addition to this, minibatch_size is also a parameter of ImpalaConfig.training method, so I can't tell why the minibatch_size kwargs key is not passed to the ImpalaConfig.training method as shown in https://github.com/ray-project/ray/blob/10009390b0ff61875b45cbab75052a89332b528e/rllib/algorithms/appo/appo.py#L196

Versions / Dependencies

Ray 2.9.3 Python 3.11.7

Reproduction script

`class HPRanges:

def __init__(self):

    self.train_batch_size = sample.choice([256, 512, 1024])
    self.gamma = sample.loguniform(0.8, 0.9997)
    self.num_sgd_iter = sample.choice([10, 15, 20])
    self.vf_clip_param = sample.loguniform(0.02, 0.2)
    self.lr = sample.loguniform(5e-6, 5e-3)
    self.kl_coeff = sample.loguniform(0.0005, 0.01)
    self.kl_target = sample.loguniform(0.0005, 0.003)
    self.lambda_ = sample.loguniform(0.90, 0.9999)
    self.clip_param = sample.loguniform(0.02, 0.2)
    self.grad_clip = sample.qlograndint(1, 30, 1)
    self.exploration_config = {
        # The Exploration class to use. In the simplest case, this is the name
        # (str) of any class present in the `rllib.utils.exploration` package.
        # You can also provide the python class directly or the full location
        # of your class (e.g. "ray.rllib.utils.exploration.epsilon_greedy.
        # EpsilonGreedy").
        "type": "StochasticSampling",
        # Add constructor kwargs here (if any).
    }

    # Impala & APPO specific and shared HPs
    # shared - overriden with APPO settings
    self.lr_schedule = [
        [0, 1e-1],
        [int(1e2), 1e-2],
        [int(1e3), 1e-3],
        [int(1e4), 1e-4],
        [int(1e5), 1e-5],
        [int(1e6), 1e-6],
        [int(1e7), 1e-7],
    ]

    self.entropy_coeff =  self.lr_schedule # sample.loguniform(5e-5, 1e-2)
    self.vf_loss_coeff = sample.loguniform(1e-3, 1e-1)

    # The factor by which to update the target policy network towards the current policy network. Can range between 0 and 1. e.g. updated_param = tau * current_param + (1 - tau) * target_param
    self.tau = sample.loguniform(0.001, 1.0)

    # The frequency to update the target policy and tune the kl loss coefficients that are used during training. After setting this parameter, the algorithm waits for at least target_update_frequency * minibatch_size * num_sgd_iter number of samples to be trained on by the learner group before updating the target networks and tuned the kl loss coefficients that are used during training. NOTE: This parameter is only applicable when using the Learner API (_enable_new_api_stack=True).
    self.target_update_frequency=sample.choice([1, 2, 3])

    ############################
    self.num_multi_gpu_tower_stacks = 0  # APPO & Impala
    self.minibatch_buffer_size = 5  # APPO & Impala
    self.sgd_minibatch_size = sample.choice([64, 32, 128]) # int(batch_size / 2)

    self.replay_proportion = 4  # Enable experience replay
    self.replay_buffer_num_slots = 100  # APPO & Impala
    self.learner_queue_size = 50  # APPO & Impala

    self.learner_queue_timeout = 5000  # APPO & Impala
    self.timeout_s_sampler_manager = 100.0
    self.timeout_s_aggregator_manager = 100.0
    self.broadcast_interval = 1  # APPO & Impala

    self.num_aggregation_workers = 0  # 1 has 4 models
    self.max_requests_in_flight_per_aggregator_worker = 2 if self.num_aggregation_workers > 0 else 1
    #############################
    # APPO specific settings:

    self.vtrace = True  # APPO & Impala
    self.use_critic_APPO = False if self.vtrace else True
    self.use_kl_loss = True
    self.use_gae_APPO= False if self.vtrace else True

    # Impala only- since APPO subclasses Impala Config, these can be modified for APPO too
    self.vtrace_clip_rho_threshold = 1.0 if self.vtrace else None
    self.vtrace_clip_pg_rho_threshold = 1.0 if self.vtrace else None
    self.vtrace_drop_last_ts = True if self.vtrace else None

    ###################################
    self.opt_type = "rmsprop" if self.vtrace else "adam"  # APPO & Impala
    self.decay = 0.99  if self.opt_type == "rmsprop" else None
    self.momentum = 0.0  if self.opt_type == "rmsprop" else None
    self.epsilon = sample.loguniform(1e-7, 1e-1)  if self.opt_type == "rmsprop" else None

    ###################################
    self._separate_vf_optimizer = False # Only supported for some algorithms (APPO, IMPALA) on the old API stack
    self._lr_vf = sample.loguniform(5e-6, 0.003) if self._separate_vf_optimizer else None
    self.after_train_step = None

class APPOLearnerHPs:

def __init__(self):
    self.params = HPRanges()

def config(self):
    kwargs = dict(# Impala Based configs
        num_sgd_iter=self.params.num_sgd_iter,  # type: ignore
        minibatch_size=self.params.sgd_minibatch_size,  # type: ignore
        entropy_coeff=self.params.entropy_coeff,  # type: ignore
        vf_loss_coeff=self.params.vf_loss_coeff,  # type: ignore
        lr_schedule=self.params.lr_schedule,  # type: ignore
        grad_clip=self.params.grad_clip,  # type: ignore
        minibatch_buffer_size = self.params.minibatch_buffer_size,

        num_multi_gpu_tower_stacks=self.params.num_multi_gpu_tower_stacks,
        vtrace_clip_rho_threshold = self.params.vtrace_clip_rho_threshold,
        vtrace_clip_pg_rho_threshold = self.params.vtrace_clip_pg_rho_threshold,
        _lr_vf = self.params._lr_vf,
        after_train_step = self.params.after_train_step,
        _separate_vf_optimizer = self.params._separate_vf_optimizer,

        # TODO: Mods to look into
        replay_proportion = self.params.replay_proportion,
        replay_buffer_num_slots = self.params.replay_buffer_num_slots,
        learner_queue_size = self.params.learner_queue_size,
        learner_queue_timeout = self.params.learner_queue_timeout,
        max_requests_in_flight_per_aggregator_worker = self.params.max_requests_in_flight_per_aggregator_worker,
        timeout_s_sampler_manager = self.params.timeout_s_sampler_manager,
        timeout_s_aggregator_manager = self.params.timeout_s_aggregator_manager,
        broadcast_interval = self.params.broadcast_interval,
        num_aggregation_workers = self.params.num_aggregation_workers,
        opt_type = self.params.opt_type,
        decay = self.params.decay,
        momentum =self.params.momentum,
        epsilon = self.params.epsilon)
    return APPOConfig().training(
        vtrace=self.params.vtrace,
        use_critic=self.params.use_critic_APPO,
        use_gae=self.params.use_gae_APPO,
        lambda_=self.params.lambda_,  # type: ignore
        clip_param=self.params.clip_param,  # type: ignore
        use_kl_loss=self.params.use_kl_loss,  # type: ignore
        kl_coeff=self.params.kl_coeff,  # type: ignore
        kl_target=self.params.kl_target,  # type: ignore
        tau=self.params.tau,  # type: ignore
        target_update_frequency=self.params.target_update_frequency, # type: ignore
        **kwargs
        )

`

Issue Severity

Medium: It is a significant difficulty but I can work around it.

grizzlybearg commented 4 months ago

After further investigation, I've found out this error shows up if you use minibatch_size as part of hyperparameter mutations dict with PBT or PB2. I'm not sure if this affects other search algorithms. #43467 would solve this