Runtime Error with input size 96

princeton-vl / pytorch_stacked_hourglass

Pytorch implementation of the ECCV 2016 paper "Stacked Hourglass Networks for Human Pose Estimation"

BSD 3-Clause "New" or "Revised" License

465 stars 94 forks source link

Runtime Error with input size 96 #21

Closed matteorr closed 3 years ago

matteorr commented 3 years ago

When training a network from scratch with images of input size 96, I get the following error trace:

     84         low3 = self.low3(low2)
     85         up2  = self.up2(low3)
---> 86         return up1 + up2
RuntimeError: The size of tensor a (3) must match the size of tensor b (2) at non-singleton dimension 3

Here is a small snippet to reproduce the error:

import torch
from models.posenet import PoseNet

pose_net = PoseNet(nstack=4, inp_dim=96, oup_dim=16)
inp = torch.rand(16, 96, 96, 3)
oup = pose_net(inp)

If my understanding is correct, this is due to the fact that with input size 96 and depth 4 the maxpooling operations result in up1 having size 3 and up2 having size 2 (since it is an upsampling of low3 which has size 1). What is your suggestion on how to solve this?

I understand you might say to just use a larger image size, but could you please provide some indications on how to fix the problem from an architecture point of view?

Thanks!

crockwell commented 3 years ago

While I can't speak exactly to the details of your problem, here's the way I would approach it:

If you look in models/layers.py the Hourglass class has an argument "n" which determines how many recursive calls are used to the hourglass, where each has a lower input resolution size. You may want to decrease this if you are using smaller than 256x256 resolution.

Pooling may look a little odd if you don't have a multiple of 16x16, which could also cause problems. Or perhaps certain output dimensions might be off based on input dimension. You'd just have to play with it if this were the problem. Maybe slightly changing size of image is the best solution, as you've guessed :)

matteorr commented 3 years ago

Thanks for the quick reply! Brief follow-ups:

You are talking about this parameter which is hard-coded to 4?
Could you please elaborate on the impact of changing the number of recursive calls?
I don't see mentions of it in the original paper, do you know of any study that looked into it?

Thanks a lot again! Feel free to close after your reply.

crockwell commented 3 years ago

Yes, that parameter. By decreasing recursive calls, you decrease the "depth" of the hourglass. In other words, looking at Fig 3. in the paper, the middle (lowest resolution) of the hourglass would no longer be used. You can think of each recursive call as another layer in this diagram. Does that make sense?

I don't think there are ablations in the paper: usually using less than this full hourglass can probably be assumed to be not as effective for high resolution. In your case, it may be the best alternative.

matteorr commented 3 years ago

Makes sense! I'll pose here updates if I find out something interesting. Thanks again!