pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.48k stars 480 forks source link

The project happend ProcessExitedException on colab with No obvious error message #2725

Closed sugangsky closed 3 years ago

sugangsky commented 3 years ago

❓ Questions and Help

This is my project on colab [here]。(https://colab.research.google.com/drive/1mgtHa_qPtUr2jSDk1sXK0bXpKEmI51Lf?usp=sharing) it finished with ProcessExitedException, and it's error message is not clear.

@taylanbil

taylanbil commented 3 years ago

Any way for us to reproduce this? Seems it would not run on our side due to dataset not being present.

Does this happen w/ 1 core? xmp.spawn(map_fn, args=(params,), nprocs=1, start_method='fork')

sugangsky commented 3 years ago

if i set the nprocs=1 and num_workers =1 ,the colab will crash and restart.

sugangsky commented 3 years ago

the cloab crash log image

taylanbil commented 3 years ago

hard to debug without repro, but you have a high batch size, which makes me think host may be running out of memory.

sugangsky commented 3 years ago

after testing, I sure it is except at this line . image

taylanbil commented 3 years ago

Are you using the latest nightly version?

Can you try lowering your batch size on 1 core?

ademyanchuk commented 3 years ago

I would like to add a bit. I can't reproduce tutorial example "PyTorch on Cloud TPUs: MultiCore Training AlexNet on Fashion MNIST" Colab notebook. Stable release and 8 cores gives me 600 seconds training and 30 seconds evaluation. And "nightly build" is crashing with the same "ProcessExitedException".

I can also see those stack trace from C++ perhaps:

Failed to connect to client mesh master: 457eb84c329c:60271 Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/meshservice.cc:316 : Check failed: impl->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))

If it worth to mention, I am using Colab Pro

taylanbil commented 3 years ago

@ademyanchuk your issue, even though it is the same symptom, is probably different. we think we have resolved that one, could you try again?

ademyanchuk commented 3 years ago

Hi, @taylanbil Thank you for quick reply. I fast checked on official Resnet18 example (which I am using as a template for my project). It is now working without any issues on "nightly" build. Thanks again. Cheers))

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.