Using Megatron Train GPT3

Kingsleyandher commented 1 year ago

Hello, there was an error when I used the Sophia optimizer to train GPT3 with Megatron. The error point is that grad cannot be substituted into the optimizer with require_grad = True state to calculate the second derivative. Do you know how to solve this problem?

File "/root/miniconda3/envs/torch18/lib/python3.7/site-packages/torch/autograd/__init__.py", line 277, in grad allow_unused, accumulate_grad=False) # Calls into the C++ engine to run the backward pass RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.

Upvote & Fund

We're using Polar.sh so you can upvote and help fund this issue.
We receive the funding once the issue is completed & confirmed by you.
Thank you in advance for helping prioritize & fund our backlog.

Kingsleyandher commented 1 year ago

class HutchinsonEstimator(HessianEstimator):
    def estimate(self, p, grad):
        u = torch.randn_like(grad)
        grad_dot_u = torch.sum(grad * u)
        print(f"grad_dot_u requires grad: {grad_dot_u.requires_grad}")   #  -> False

        # ↓  RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn.
        hessian_vector_product = torch.autograd.grad(    
            grad_dot_u, p, retain_graph=True)[0]
        return u * hessian_vector_product

Kingsleyandher commented 1 year ago

This problem same like #7 .

liuslnlp commented 1 year ago

Hello @Kingsleyandher , I meet the same question, is your problem solved?

kyegomez / Sophia

Using Megatron Train GPT3 #21

Upvote & Fund