higgsfield / Capsule-Network-Tutorial

Pytorch easy-to-follow Capsule Network tutorial
762 stars 135 forks source link

RuntimeError:index_select() and issue about DataParallel #2

Open ChengHuang-CH opened 6 years ago

ChengHuang-CH commented 6 years ago

First of all, thanks, its definitely an easy to follow CapsNet tutorial for me as a beginner, but I found an error after running the code:

RuntimeError: index_select(): argument 'index' must be Variable, not torch.cuda.LongTensor

I solved this issue same as https://github.com/gram-ai/capsule-networks/issues/13, in Decoder class :

 masked = masked.index_select(dim=0, index=max_length_indices.squeeze(1).data)

".data" should be removed.

Then I successfully trained on single GPU according to this tutorial, but when I tried to train the net on two GPUs according to PyTorch data parallelism tutorial :

if USE_CUDA:
      print("Let's use %d GPUs" % torch.cuda.device_count())
      capsule_net = nn.DataParallel(capsule_net).cuda()

but it produced an error AttributeError: 'DataParallel' object has no attribute 'loss'

I'm confused, and if there is any good solution, please tell me, thanks!

(I use python 2.7.12 and pytorch 0.3.0.post4)

ChengHuang-CH commented 6 years ago

HaHa, I am very excited to be here again since I have solved some problems. And here I would share the solutions about DataParallel and my experiences on new pytorch 0.4.0 + windows10.

Firstly, I solved the problem about DataParallel problem: AttributeError: 'DataParallel' object has no attribute 'loss' The solution came out from the topic --How to reach model attributes wrapped by nn.DataParallel? so I could revise the code as follows:

USE_CUDA = True
Use_Dataparallel = False  # firstly set single gpu mode if using cuda

# ...{other codes}

# code  to activate DataParallel mode:
if USE_CUDA:
    if torch.cuda.device_count() > 1:
        print("Let's use %d GPUs" % torch.cuda.device_count())
        Use_Dataparallel = True   # transfer to multi-gpu mode
        capsule_net = nn.DataParallel(capsule_net).cuda()

# ...{other codes}

 if Use_Dataparallel:
      loss = capsule_net.module.losses(inputs, output, target, reconstructions)  #  use 'module' to reach attributes 'losses' wrapped by nn.DataParallel
 else:
      loss = capsule_net.losses(inputs, output, target, reconstructions)  # single gpu mode

Secondly, I test this code on the official released version of pytorch 0.4.0 on Windows10, there would be somewhere to pay attention to:

(1) A special multiprocessing error on windows--Windows FAQ

RuntimeError:
    An attempt has been made to start a new process before the
    current process has finished its bootstrapping phase.

   This probably means that you are not using fork to start your
   child processes and you have forgotten to use the proper idiom
   in the main module:

       if __name__ == '__main__':
           freeze_support()

So all code should be put under if __name__ == '__main__': except four network definition classes.

(2) Error about 'torch.sparse'

target= torch.sparse.torch.eye(10).index_select(dim=0, index=target)
AttributeError: module 'torch.sparse' has no attribute 'torch'

According to a similar question, it would work well after replacing torch.sparse.torch.eye(10) with torch.eye(10)

(3) An userwaring to use tensor.item() instead of .data[0]

UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number
train_loss += loss.data[0]  # transfer loss.data[0] to loss.item() in pytorch 0.4.0

so it would be OK after being revised as follows: train_loss += loss.item()

(Those tests are based on Windows 10 + python 3.6 + pytorch 0.4.0)