Closed sugangsky closed 3 years ago
Any way for us to reproduce this? Seems it would not run on our side due to dataset not being present.
Does this happen w/ 1 core? xmp.spawn(map_fn, args=(params,), nprocs=1, start_method='fork')
if i set the nprocs=1 and num_workers =1 ,the colab will crash and restart.
the cloab crash log
hard to debug without repro, but you have a high batch size, which makes me think host may be running out of memory.
after testing, I sure it is except at this line .
Are you using the latest nightly version?
Can you try lowering your batch size on 1 core?
I would like to add a bit. I can't reproduce tutorial example "PyTorch on Cloud TPUs: MultiCore Training AlexNet on Fashion MNIST" Colab notebook. Stable release and 8 cores gives me 600 seconds training and 30 seconds evaluation. And "nightly build" is crashing with the same "ProcessExitedException".
I can also see those stack trace from C++ perhaps:
Failed to connect to client mesh master: 457eb84c329c:60271 Exception in device=TPU:7: tensorflow/compiler/xla/xla_client/meshservice.cc:316 : Check failed: impl->channel->WaitForConnected( std::chrono::system_clock::now() + std::chrono::seconds(connect_wait_seconds))
If it worth to mention, I am using Colab Pro
@ademyanchuk your issue, even though it is the same symptom, is probably different. we think we have resolved that one, could you try again?
Hi, @taylanbil Thank you for quick reply. I fast checked on official Resnet18 example (which I am using as a template for my project). It is now working without any issues on "nightly" build. Thanks again. Cheers))
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
❓ Questions and Help
This is my project on colab [here]。(https://colab.research.google.com/drive/1mgtHa_qPtUr2jSDk1sXK0bXpKEmI51Lf?usp=sharing) it finished with ProcessExitedException, and it's error message is not clear.
@taylanbil