snuspl / nimble

Lightweight and Parallel Deep Learning Framework
Other
264 stars 32 forks source link

Questions about compatible version of torchvision #4

Open jp7c5 opened 3 years ago

jp7c5 commented 3 years ago

Hello. Thanks for sharing this project.

I could install nimble following the installation guide. It seems that the torch version is "1.4.0a0+61ec0ca". To use torch with torchvision, I installed it by the following script (torchvision of CUDA 10.2) pip install torchvision==0.5.0 -f https://download.pytorch.org/whl/cu102/torch_stable.html and since this reinstalls different version of PyTorch, I removed PyTorch and rebuilt the nimble. I'm curious whether this method is correct, but I could import both torch==1.4.0a0+61ec0ca and torchvision==0.5.0 anyway.

However, I'm having an error which seems to be related to torchvison. For example, import torch torch.ops.torchvision.nms generates a runtime error RuntimeError: No such operator torchvision::nms.

Since the example code in README uses torchvision, could you let me know how to install torchvision which is compatible with nimble?

gyeongin commented 3 years ago

When we build PyTorch from source, we should also build torchvision from source because of the issue you've mentioned: pip-installing torchvision will reinstall different version of PyTorch.

You should:

  1. clone torchvision repo
  2. checkout to v0.5.0 tag (because torchvision v0.5.0 is the latest version compatible with PyTorch v1.4.1)
  3. run python setup.py install

Note that running torchvision's NMS operation with Nimble will have a problem. Nimble is built for optimized GPU task scheduling, so the PyTorch module passed to Nimble should perform all computation on GPU. However, torchvision's NMS implementation does not satisfy this constraint, as it performs some logic on CPU.

You can try these two options.

  1. Carve out "GPU-only" "static" part(s) from your PyTorch module, apply Nimble on those parts separately, and wire the resulting Nimble modules and the rest of your PyTorch module.
  2. Adopt GPU-only NMS implementation. TensorRT's batchedNMS and NMS plugin could be a good choice.
jp7c5 commented 3 years ago

Thanks for the quick reply.

By following your suggestion, I built torchvision from source and surprisingly, the error related to nms doesn't show up. But still, I'm having the following error AttributeError: module 'torch.distributed' has no attribute 'init_process_group'. I saw #1 , so is this expected for the current status?

Without distributed setting, the default training code runs smoothly. In the process of applying nimble for this single GPU setup, I noticed that the model to be wrapped by nimble should have strict input and output format (mostly comprising of torch Tensors). I don't know if this a must, but if not, the relaxation of this condition would make nimble easier to use :)