Closed TY-cc closed 3 weeks ago
And i use python nodectl.py -c 2pc.json start --node_id node:1 &> node1.log &
to start node0 and node1. it report:
[2024-09-26 14:29:27,927] [MainProcess] Traceback (most recent call last):
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 326, in Run
ret_objs = fn(self, *args, **kwargs)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 589, in builtin_spu_run
rt.run(spu_exec)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/api.py", line 44, in run
return self._vm.Run(executable.SerializeToString())
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:427] Get data timeout, key=root-0:P2P-10510:1->0
stacktrace:
#0 yacl::link::Context::RecvInternal()+0x773196ae946b
#1 yacl::link::Context::Recv()+0x773196aeab96
#2 spu::mpc::cheetah::CheetahDot::Impl::doDotOLESenderRecvStep()+0x773195e4050e
#3 spu::mpc::cheetah::CheetahDot::Impl::doDotOLE()+0x773195e457ac
#4 spu::mpc::cheetah::CheetahDot::Impl::DotOLE()+0x773195e45da1
#5 spu::mpc::cheetah::CheetahDot::DotOLE()+0x773195e45ef2
#6 std::_Function_handler<>::_M_invoke()+0x773195e11d68
#7 std::__future_base::_State_baseV2::_M_do_set()+0x773195c43b52
#8 (unknown)+0x7731d1699ee8
It might be a node crash due to OOM. You can try with less data volume or use a server with a larger RAM.
Why can it run successfully the first time?
If you interrupt the training program, the runtime is still running. Please wait the runtime finish the training tasks and then run your training program again. Or kill/relaunch the runtime, then run your training program.
It report new problems.
Traceback (most recent call last):
File "/home/whty/CC/cc_test/spu_nn_examples.py", line 162, in <module>
ppd.init(conf["nodes"], conf["devices"])
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 1178, in init
_CONTEXT = HostContext(nodes_def, devices_def)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 1098, in __init__
self.devices[name] = SPU(
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 1013, in __init__
results = [future.result() for future in futures]
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 1013, in <listcomp>
results = [future.result() for future in futures]
File "/home/whty/anaconda3/envs/spu/lib/python3.10/concurrent/futures/_base.py", line 451, in result
return self.__get_result()
File "/home/whty/anaconda3/envs/spu/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/home/whty/anaconda3/envs/spu/lib/python3.10/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 247, in run
return self._call(self._stub.Run, fn, *args, **kwargs)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 236, in _call
rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 214, in rebuild_messages
return b''.join([msg for msg in msgs])
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 214, in <listcomp>
return b''.join([msg for msg in msgs])
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/spu/utils/distributed_impl.py", line 236, in <genexpr>
rsp_data = rebuild_messages(rsp_itr.data for rsp_itr in rsp_gen)
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/grpc/_channel.py", line 543, in __next__
return self._next()
File "/home/whty/anaconda3/envs/spu/lib/python3.10/site-packages/grpc/_channel.py", line 969, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:61320: Failed to connect to remote host: connect: Connection refused (111)"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:127.0.0.1:61320: Failed to connect to remote host: connect: Connection refused (111)", grpc_status:14, created_time:"2024-09-26T16:41:19.208832718+08:00"}"
Make sure you have all nodes up.
yeah, the nodes not start. Thanks your reply!
Another question, why is AUC different?
In plaintext, AUC is auc=0.9927939731411726
In SPU,AUC is auc=0.9954143465443825
.
But it is equal at examples code
Issue Type
Support
Modules Involved
SPU runtime
Have you reproduced the bug with SPU HEAD?
Yes
Have you searched existing issues?
Yes
SPU Version
0.9.3
OS Platform and Distribution
linux ubuntu 22.04
Python Version
Python 3.10.4
Compiler Version
GCC 11.4
Current Behavior?
I am using spu for neural network training. When I first executed it, the source code could run normally. The second time I ran it, I forcibly interrupted the execution of the source code without completing the training. The third time I run it, it report a
RuntimeError
. code as follow:Standalone code to reproduce the issue
Relevant log output