Add support for TPU devices

vfdev-5 commented 3 years ago

Clear and concise description of the problem

It would be good to provide an option to select accelerator as TPU instead of GPU We can also auto-select TPU accelerator if open with Colab + add torch_xla installation steps.

What to do: 0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run 1) Update UI

Add a drop-out menu for backend selection: "nccl" and "xla-tpu" in "Training Options"
when user selects "xla-tpu", training should be only distributed with 8 processes and "Run the training with torch.multiprocessing.spawn". 2) Update content: README.md and other impacted files 3) if exported to Colab, we need to make sure that accelerator is "TPU"

Alternative

Additional context

afzal442 commented 3 years ago

hi there, trying to refactor it with a little digging into this following this way https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb

We can allow it with this nature; cc @vfdev-5

afzal442 commented 3 years ago

something like this as follow;

afzal442 commented 3 years ago

// @ydcjeff

vfdev-5 commented 2 years ago

@sayantan1410 I updated issue description adding few initial steps on how I would tackle this issue.

sayantan1410 commented 2 years ago

@vfdev-5 I will start to work as per the description, and will let you know if I face some problem.

sayantan1410 commented 2 years ago

0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?

vfdev-5 commented 2 years ago

Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend xla-tpu: python main.py --nproc_per_node 8 --backend nccl. If everything is correctly done, training should probably run

@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?

From code-generator and exporting in colab as I explained. You can't test that locally as you should use TPUs.

sayantan1410 commented 2 years ago

@vfdev-5 Got it, Thank you.

sayantan1410 commented 2 years ago

@vfdev-5 Hello, I am facing a issue to run the colab notebook, I tried to install the torch_xla manually and then start the training, Here's the link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing. The main problem that I am facing is that whenever I am running '!pip install -r requirements.txt' it is uninstalling the torch 1.9 version and reinstalling another version. But without installing the requirements.txt also it is not working.

Can you please check it out and let me know, what I am missing.

vfdev-5 commented 2 years ago

@sayantan1410 can you please update your colab and show where you call !pip install -r requirements.txt and the output it gives. By the way, I also forgot in the description to mention that we need to set accelerator as TPU (looks like you already set it).

If you check the content of requirements.txt:

torch>=1.10.1
torchvision>=0.11.2
pytorch-ignite>=0.4.7
pyyaml

so, it is expected to reinstall torch. You need temporarily update it like below

- torch>=1.10.1
+ torch
- torchvision>=0.11.2
+ torchvision
pytorch-ignite>=0.4.7
pyyaml

sayantan1410 commented 2 years ago

@vfdev-5 Okay I will try to remove the two from the requirements.txt and try. And setting the accelerator to TPU was written in the colab notebook that you linked in the description.

sayantan1410 commented 2 years ago

@vfdev-5 Hey, I was trying to change the requirements.txt from here Screenshot (12) But I cannot edit this. So I tried to install the other libraries manually, but the same problem is persisting. The colab link is same as the previous. Let me know if there is some other way to change the requirements.txt .

vfdev-5 commented 2 years ago

@sayantan1410 the issue in colab is not with the dependencies but with the way you start trainings. Please read ignite docs on idist.Parallel and also see the step 1 and 2 of this issue description: you have to use another backend: xla-tpu instead of None

sayantan1410 commented 2 years ago

@vfdev-5 Hey I tried to run the code but getting a "Aborted: Session 96f4ae2c056673d1 is not found" error. Can I start working on the UI in the meantime while I am trying to solve the colab issue ? Link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing

vfdev-5 commented 2 years ago

Can I start working on the UI in the meantime while I am trying to solve the colab issue ?

yes, that's the final goal of the issue. The work with colab is a step 0 to check if things could work.

sayantan1410 commented 2 years ago

@vfdev-5 > "Aborted: Session 96f4ae2c056673d1 is not found"

Can you please check once why is this coming ?

vfdev-5 commented 2 years ago

Looks like an internal issue with TPUs on Colab, try to do "factory reset runtime" and see if the issue persists. If you have a Kaggle account, you can also check the same code on their TPUs.

sayantan1410 commented 2 years ago

@vfdev-5 I tried doing "factory reset runtime", but that did not work. I will try running it on Kaggle notebooks. Also for the UI update part, I have done something like this - Screenshot (14) Should I populate the dropdown by creating "backend.json" and then calling it in " TabTemplates.vue ", Or there is a better way ? Also what is the next step ?

pytorch-ignite / code-generator