NaN Values when optimizing over larger batch sizes or odd pose videos

smandava98 commented 12 months ago

Hi!

I've been experimenting with ways to make the motion chunk optimization step faster when inferencing over a single video (it takes over 1-2+ hours.. on an A10 GPU whereas without the motion chunk step, it takes 10 min). I tried experimenting with batch size of 10. Results are still yet to be determined but I've noticed that, sometimes I get Nan values in the forward pass:

File "/home/ubuntu/slahmr/slahmr/optim/losses.py", line 258, in forward
    cur_loss = self.init_motion_prior_loss(
  File "/home/ubuntu/mambaforge/envs/slahmr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/slahmr/slahmr/optim/losses.py", line 550, in forward
    loss = -self.init_motion_prior["gmm"].log_prob(init_state)
  File "/home/ubuntu/mambaforge/envs/slahmr/lib/python3.10/site-packages/torch/distributions/mixture_same_family.py", line 150, in log_prob
    self._validate_sample(x)
  File "/home/ubuntu/mambaforge/envs/slahmr/lib/python3.10/site-packages/torch/distributions/distribution.py", line 294, in _validate_sample
    raise ValueError(
ValueError: Expected value argument (Tensor of shape (12, 138)) to be within the support (IndependentConstraint(Real(), 1)) of the distribution MixtureSameFamily(
  Categorical(probs: torch.Size([12]), logits: torch.Size([12])),
  MultivariateNormal(loc: torch.Size([12, 138]), covariance_matrix: torch.Size([12, 138, 138]))), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='cuda:0',
       grad_fn=<CatBackward0>)

This happens when I either increase the batch size or put in a video where someone is in unnatural start poses (e.g hanging upside down, etc.). Is this because the optimization is failing to converge? Is there a way to remedy this? If increasing the batch size is not possible, what are recommended ways to make this run faster on a single video? Because the motion chunk optimization step is taking such a huge amount of time on a good GPU for one video..

Note: without the motion chunk optimization step, all videos run fine so this issue only arises at that final step.

Thanks a lot!

vye16 commented 12 months ago

This happens because the optimization is already failing at that point. Because we don’t know the global displacement at all initially, optimization in the latent space of the motion prior needs to be slowly guided toward the final solution. In these cases, you can try starting with better initializations in the global frame if you want to speed up optimization, but it’s still something we are looking into as well.

On Mon, Jul 24, 2023 at 6:47 PM smandava98 @.***> wrote:

(Making a new issue here).

I've been experimenting with ways to make the motion chunk optimization step faster when inferencing over a single video. I tried experimenting with batch size of 10. Results are still yet to be determined but I've noticed that, sometimes I get Nan values in the forward pass:

File "/home/ubuntu/slahmr/slahmr/optim/losses.py", line 258, in forward cur_loss = self.init_motion_prior_loss( File "/home/ubuntu/mambaforge/envs/slahmr/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/slahmr/slahmr/optim/losses.py", line 550, in forward loss = -self.init_motion_prior["gmm"].log_prob(init_state) File "/home/ubuntu/mambaforge/envs/slahmr/lib/python3.10/site-packages/torch/distributions/mixture_same_family.py", line 150, in log_prob self._validate_sample(x) File "/home/ubuntu/mambaforge/envs/slahmr/lib/python3.10/site-packages/torch/distributions/distribution.py", line 294, in _validate_sample raise ValueError( ValueError: Expected value argument (Tensor of shape (12, 138)) to be within the support (IndependentConstraint(Real(), 1)) of the distribution MixtureSameFamily( Categorical(probs: torch.Size([12]), logits: torch.Size([12])), MultivariateNormal(loc: torch.Size([12, 138]), covariance_matrix: torch.Size([12, 138, 138]))), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=)

This happens when I either increase the batch size or put in a video where someone is in unnatural start poses (e.g hanging upside down, etc.). Is this because the optimization is failing to converge? Is there a way to remedy this?

Thanks a lot!

— Reply to this email directly, view it on GitHub https://github.com/vye16/slahmr/issues/31, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLOKW6HYJPV3Y2P4LGBRDDXR4QS7ANCNFSM6AAAAAA2WLM22I . You are receiving this because you are subscribed to this thread.Message ID: @.***>

smandava98 commented 12 months ago

Thanks for your reply. If I only care about having the motion fit the motion prior but not have it be optimized in the global space, where can I comment this processing step out to improve the time for processing?

I still would want to use the humor motion prior and the gmm loss but skip the world scene optimizing.

vye16 commented 12 months ago

The motion prior captures both the local pose and the displacement in the global frame. If you only want to optimize the local body pose in each frame independently, you can just optimize with the first two stages and skip the fitting stage with humor entirely.

On Tue, Jul 25, 2023 at 10:04 AM smandava98 @.***> wrote:

Thanks for your reply. If I only care about having the motion fit the motion prior but not have it be optimized in the global space, where can I comment this processing step out to improve the time for processing?

— Reply to this email directly, view it on GitHub https://github.com/vye16/slahmr/issues/31#issuecomment-1650212571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLOKW6Z5XI7BG3TDSTUV5TXR74CJANCNFSM6AAAAAA2WLM22I . You are receiving this because you commented.Message ID: @.***>

smandava98 commented 12 months ago

Ah. I experimented with that and it seems the motion prior fitting does a significantly better job than without it. My only issue is how long it takes even on a good GPU (1-2+ hours per video). I will try experimenting with better initializations then. Thanks for the help!

smandava98 commented 12 months ago

If I skip the optimization step (aka motion optimizer chunks) entirely, would the final values be the latest npz file under the "smooth_fit" folder?

Also, if I skip the motion optimizer chunks part, I will still have the outputs in world coordinates (just unoptimized) or would they still be in local frame coordinates?

vye16 commented 12 months ago

Yes, they'll be under smooth_fit, and yes, they'll be in the world coordinates, just unoptimized.

On Wed, Jul 26, 2023 at 3:25 PM smandava98 @.***> wrote:

If I skip the optimization step (aka motion optimizer chunks) entirely, would the final values be the latest npz file under the "smooth_fit" folder?

Also, if I skip the motion optimizer chunks part, I will still have the outputs in world coordinates (just unoptimized) or would they still be in local frame coordinates?

— Reply to this email directly, view it on GitHub https://github.com/vye16/slahmr/issues/31#issuecomment-1652620327, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLOKW77YTERK73U4EWMJITXSGKMVANCNFSM6AAAAAA2WLM22I . You are receiving this because you commented.Message ID: @.***>

vye16 / slahmr

NaN Values when optimizing over larger batch sizes or odd pose videos #31