Open vfdev-5 opened 3 years ago
hi there, trying to refactor it with a little digging into this following this way https://colab.research.google.com/github/pytorch/xla/blob/master/contrib/colab/getting-started.ipynb
We can allow it with this nature; cc @vfdev-5
something like this as follow;
// @ydcjeff
@sayantan1410 I updated issue description adding few initial steps on how I would tackle this issue.
@vfdev-5 I will start to work as per the description, and will let you know if I face some problem.
0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend
xla-tpu
:python main.py --nproc_per_node 8 --backend nccl
. If everything is correctly done, training should probably run
@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?
- Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend
xla-tpu
:python main.py --nproc_per_node 8 --backend nccl
. If everything is correctly done, training should probably run@vfdev-5 Hello, should i do this from the code-generator website or by running it locally or both works ?
From code-generator and exporting in colab as I explained. You can't test that locally as you should use TPUs.
@vfdev-5 Got it, Thank you.
@vfdev-5 Hello, I am facing a issue to run the colab notebook, I tried to install the torch_xla manually and then start the training, Here's the link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing. The main problem that I am facing is that whenever I am running '!pip install -r requirements.txt' it is uninstalling the torch 1.9 version and reinstalling another version. But without installing the requirements.txt also it is not working.
Can you please check it out and let me know, what I am missing.
@sayantan1410 can you please update your colab and show where you call !pip install -r requirements.txt
and the output it gives. By the way, I also forgot in the description to mention that we need to set accelerator as TPU (looks like you already set it).
If you check the content of requirements.txt
:
torch>=1.10.1
torchvision>=0.11.2
pytorch-ignite>=0.4.7
pyyaml
so, it is expected to reinstall torch
. You need temporarily update it like below
- torch>=1.10.1
+ torch
- torchvision>=0.11.2
+ torchvision
pytorch-ignite>=0.4.7
pyyaml
@vfdev-5 Okay I will try to remove the two from the requirements.txt and try. And setting the accelerator to TPU was written in the colab notebook that you linked in the description.
@vfdev-5 Hey, I was trying to change the requirements.txt from here But I cannot edit this. So I tried to install the other libraries manually, but the same problem is persisting. The colab link is same as the previous. Let me know if there is some other way to change the requirements.txt .
@sayantan1410 the issue in colab is not with the dependencies but with the way you start trainings. Please read ignite docs on idist.Parallel
and also see the step 1 and 2 of this issue description: you have to use another backend: xla-tpu instead of None
@vfdev-5 Hey I tried to run the code but getting a "Aborted: Session 96f4ae2c056673d1 is not found" error. Can I start working on the UI in the meantime while I am trying to solve the colab issue ? Link to the notebook - https://colab.research.google.com/drive/15tlo1Js4vCXSDB5yqLQJ9byvwvtoEuEU?usp=sharing
Can I start working on the UI in the meantime while I am trying to solve the colab issue ?
yes, that's the final goal of the issue. The work with colab is a step 0 to check if things could work.
@vfdev-5 > "Aborted: Session 96f4ae2c056673d1 is not found"
Can you please check once why is this coming ?
Looks like an internal issue with TPUs on Colab, try to do "factory reset runtime" and see if the issue persists. If you have a Kaggle account, you can also check the same code on their TPUs.
@vfdev-5 I tried doing "factory reset runtime", but that did not work. I will try running it on Kaggle notebooks. Also for the UI update part, I have done something like this - Should I populate the dropdown by creating "backend.json" and then calling it in " TabTemplates.vue ", Or there is a better way ? Also what is the next step ?
Clear and concise description of the problem
It would be good to provide an option to select accelerator as TPU instead of GPU We can also auto-select TPU accelerator if open with Colab + add torch_xla installation steps.
What to do: 0) Try a template with TPUs. Choose distributed training option with 8 processes and spawning option. "Open in colab" one template, for example, vision classification template, install manually torch_xla (see https://colab.research.google.com/drive/1E9zJrptnLJ_PKhmaP5Vhb6DTVRvyrKHx) and run the code with backend
xla-tpu
:python main.py --nproc_per_node 8 --backend nccl
. If everything is correctly done, training should probably run 1) Update UISuggested solution
Alternative
Additional context