ml-explore / mlx

MLX: An array framework for Apple silicon
https://ml-explore.github.io/mlx/
MIT License
17.37k stars 1.01k forks source link

[BUG] implementing entropy in adam return fix loss without converging #1581

Closed thegodone closed 3 days ago

thegodone commented 1 week ago

Describe the bug trying to implement entropy adam version (source here: https://pub.aimind.so/enhancing-adam-with-gradient-entropy-for-optimizer-regularization-c1f05248c980) return an error To Reproduce

Include code snippet

class AdamWithEntropy(Adam):
    def __init__(self, learning_rate: Union[float, Callable], betas: List[float] = [0.9, 0.999], eps: float = 1e-8, weight_decay: float = 0.01, beta_entropy: float = 0.0001):
        super().__init__(learning_rate=learning_rate, betas=betas, eps=eps)
        self.weight_decay = weight_decay
        self.beta_entropy = beta_entropy  # Factor for scaling entropy influence on LR
        self.eps = eps

    def compute_gradient_entropy(self, gradients):
        # Flatten and concatenate gradients
        flattened_grads = mx.concatenate([mx.flatten(grad) for grad in gradients if grad is not None])

        # Normalize gradients to obtain a probability distribution
        flattened_grads = mx.abs(flattened_grads) + self.eps
        flattened_grads /= mx.sum(flattened_grads)

        # Compute entropy
        entropy = -mx.sum((flattened_grads * mx.log(flattened_grads)))
        return entropy

    def apply_single(self, gradient: mx.array, parameter: mx.array, state: dict):
        # Step 1: Calculate initial learning rate
        lr = self.learning_rate.astype(gradient.dtype)

        # Step 2: Apply weight decay before applying Adam update
        parameter = parameter * (1 - lr * self.weight_decay)

        # Step 3: Core Adam update
        updated_param = super().apply_single(gradient, parameter, state)

        # Step 4: Compute entropy and adjust learning rate
        entropy = self.compute_gradient_entropy([gradient])  # For current batch/parameter
        adjusted_lr = lr * (1 + self.beta_entropy * mx.minimum(entropy, 0.1))  # Cap entropy effect

        # Apply entropy-adjusted learning rate to parameter update
        return updated_param * adjusted_lr

I fix the bug it was needed to min vs mx.minimum for pure eval compilation but this optimizer modification does not converge at all it stay to the initial loss value for any epochs, Can someone help ?

Expected behavior A clear and concise description of what you expected to happen.

Desktop (please complete the following information):

Additional context Add any other context about the problem here.

angeloskath commented 3 days ago

I can't access the link since it is behind a paywall but looking at your code above I think that you need to adjust the step not the new parameter which means that you probably need to update it as follows

# Step 3: Core Adam update
updated_param = super().apply_single(gradient, parameter, state)
delta = updated_param - parameter

...

return parameter + delta * adjusted_lr

In addition to the above, I think that this isn't necessarily an MLX bug so I will close this issue. If you have any reason to believe that there is an MLX bug that prevents convergence feel free to reopen the issue.

thegodone commented 3 days ago

thank you it converge now (it is not very good but it works!). Thanks again!