Some problems in experiments

woodenchild95 commented 2 weeks ago

Dear authors, I recently came across your LOZO work, and I think it is an outstanding piece of work. Both in terms of implementation and its efficiency, it has remarkable properties. While attempting to reproduce its experimental results, I encountered an issue and would like to inquire about the possible causes:

In my testing experiment with OPT-1.3B-sst2, I set v=1, meaning each sample in a batch has independent U and V sampling. The effect of v=1 has already been very good. However, I noticed that in the loss curve reported on the right side of Figure 4 on page 21 of the paper, v=1 almost does not converge. I would like to ask about the possible reasons for this and what the corresponding hyperparameters are. In my reproduction, I directly adopted the hyperparameter selection from mezo: lr=1e-6, perturbation e=1e-3, and rank=1, which achieves almost the same effect as mezo. Additionally, when rank=4, the performance even becomes worse. I would like to ask if you have encountered this phenomenon in your experiments. It seems related to the variance of the subspace projection onto the true gradient, but unfortunately, I did not observe any reduction in variance.

LiyuanCao commented 2 weeks ago

We appreciate your interest in our paper.

We would like to clarify a few points regarding the experiment depicted in the right panel of Figure 4 on page 21. The results shown there are from an ablation study conducted with rank = 2 and different values of 𝜈. Other hyperparameters remain consistent with those listed in Table 5. We apologize for any confusion this may have caused.

Regarding your suggestion that setting 𝜈 = 1 with rank = 1 should yield performance comparable to MeZO, we reran this experiment using the settings you specified. However, the results did not achieve the same level of performance as MeZO. Could you please provide a more detailed log of your experiments? This information would help us investigate the issue more thoroughly. In our experiments, we observed that setting 𝜈 too low often leads to insufficient subspace optimization, which generally results in suboptimal performance.

Additionally, you noted that increasing the rank leads to a drop in performance. This observation is consistent with our findings. As shown in Table 6 on page 22, our algorithm attains the best results with a rank of 2, while the performance declines when the rank is increased to 8.

Regarding this phenomenon, our explanation is that a higher rank does not necessarily result in reduced variance in the gradient estimator. Although a larger rank may lower the variance of the subspace projection onto the true gradient, it simultaneously increases the complexity of the inner optimization problem (as described in Eq. (14a) on page 6) due to its expanded scale. This increased complexity can counteract the expected variance reduction, thus not necessarily improving overall performance. Therefore, we believe that using a relatively smaller rank is optimal, as it both preserves the low-rank structure of the gradient estimator and ensures minimal memory overhead. If you wish to improve performance with a higher rank, we recommend experimenting with a larger value of 𝜈 and a reduced learning rate to better handle the increased complexity of the inner problem.

If you have additional questions, please feel free to open an issue on our GitHub repository. We are happy to assist further. Thank you again for your engagement!

woodenchild95 commented 1 week ago

@LiyuanCao Thank you for your prompt response. Regarding the experiment with v=1, in fact, each batch re-samples U and V anew. Here’s my test implementation:

‘’‘ def zo_perturb_parameters(self, random_seed=None, scaling_factor=1): """ Perturb the parameters with random vector z. Input:

random_seed: random seed for MeZO in-place perturbation (if it's None, we will use self.zo_random_seed)

scaling_factor: theta = theta + scaling_factor z eps """

# Set the random seed to ensure that we sample the same z for perturbation/update
torch.manual_seed(random_seed if random_seed is not None else self.zo_random_seed)

for name, param in self.named_parameters_to_optim:
    ### Mezo
    # z = torch.normal(mean=0, std=1, size=param.data.size(), device=param.data.device, dtype=param.data.dtype)

    ### LOZO
    if param.data.dim() == 2:
        ml, nl = param.data.size()
        u_size = torch.Size([ml, self.low_rank])
        v_size = torch.Size([self.low_rank, nl])
        ul = torch.normal(mean=0, std=1, size=u_size, device=param.data.device, dtype=param.data.dtype)
        vl = torch.normal(mean=0, std=1, size=v_size, device=param.data.device, dtype=param.data.dtype)
        z = torch.mm(ul, vl)
    elif param.data.dim() == 1:
        z = torch.normal(mean=0, std=1, size=param.data.size(), device=param.data.device, dtype=param.data.dtype)
    else:
        raise NotImplementedError("Dimension is larger than 2! Please consider high demension tensors.")

    param.data = param.data + scaling_factor * z * self.args.zo_eps / self.scalar

’‘’ Below is the segment related to LOZO extracted from my modifications to the MeZO code.

My test environment is: Pytorch 2.0.1, Cuda 11.7, Driver Version 560.94, Transformers 4.25.1 My test setups: lr = 1e-6, eps = 1e-3, and I use float16 type.

I am trying to further accelerate LOZO's convergence, but Im not sure that my implementation strictly follows your algorithm design. This setup seems to yield results similar to MeZO. The log file from MeZO doesn’t seem to have much additional information; the outputs mostly show parameters related to the OPT model. I used MeZO’s default settings, with modifications limited to the code parts mentioned above.

optsuite / LOZO

Some problems in experiments #1