tsung-wei-huang / DtCraft

A High-performance Cluster Computing Engine
https://tsung-wei-huang.github.io/DtCraft/
MIT License
144 stars 25 forks source link

Ver 0.2.2 error in single-host distributed mode #10

Open frank-y-liu opened 6 years ago

frank-y-liu commented 6 years ago

Updated to ver. 0.2.2. Local mode works fine. But the encountered following error in single-host distributed mode. Error message in the master:

frankliu@arlz010:~/Projects/DtCraft$ sudo ./bin/dtc-master           
I  6016 2018-05-29 20:22:09 master.cpp:200] Master @127.0.0.1 [agent:9909|graph:9910|webui:9912]                                          
I  6016 2018-05-29 20:22:42 master.cpp:271] Agent 0 connected @127.0.0.1 [cpu:0|mem:135081242624|disk:17340207104]                        
I  6016 2018-05-29 20:23:20 master.cpp:300] Graph 0 connected @arlz010 [vertex:4|stream:6|container:4]                                    
W  6016 2018-05-29 20:23:20 master.cpp:303] Graph 0 doesn't fit with available resources                                                  
I  6016 2018-05-29 20:23:20 master.cpp:142] Graph 0 is removed from the master                                                            
I  6016 2018-05-29 20:23:30 master.cpp:300] Graph 1 connected @arlz010 [vertex:2|stream:2|container:2]                                    
W  6016 2018-05-29 20:23:30 master.cpp:303] Graph 1 doesn't fit with available resources                                                  
I  6016 2018-05-29 20:23:30 master.cpp:142] Graph 1 is removed from the master       

Error message in the submission window:

frankliu@arlz010:~/Projects/DtCraft$ sbin/submit.sh --master=127.0.0.1 /home/frankliu/Projects/DtCraft/example/hello_world                
I  1920 2018-05-29 20:26:33 executor.cpp:159] Executor @arlz010 [stdout:38211|stderr:38045]                                               
I  1920 2018-05-29 20:26:33 executor.cpp:161] Submit graph to master @127.0.0.1:9910                                                      
I 55040 2018-05-29 20:26:33 executor.cpp:173] Solution received      
[Graph 4]                         
+----+-----+------+-----------+-------------------+                  
|Task|Agent|Status|Elapsed (s)|Memory (peak/limit)|                  
+----+-----+------+-----------+-------------------+                  
Graph finished with 1 error(s): Resource request doesn't fit in cluster

OS: Ubuntu 17.10

frank-y-liu commented 6 years ago

Add: master branch works fine

tsung-wei-huang commented 6 years ago

Yes. Please try master branch. Notice we have added support for cgroup and now you need privilege to launch master and agents. Please follow here to start the cluster.

Looks like there is no CPU configured to the agent and this is why you get "Resource request doesn't fit in cluster" error. This should be resolved in the master branch.

frank-y-liu commented 6 years ago

Thanks for getting back on this. Tried the master branch in the upstream repo. Still have the same problem. Error message from dtc-master:

I 14208 2018-05-30 14:13:06 master.cpp:200] Master @127.0.0.1 [agent:9909|graph:9910|webui:9912]                      
I 14208 2018-05-30 14:13:50 master.cpp:271] Agent 0 connected @127.0.0.1 [cpu:0|mem:135081242624|disk:17343295488]    
I 14208 2018-05-30 14:14:24 master.cpp:300] Graph 0 connected @arlz009 [vertex:2|stream:2|container:2]                
W 14208 2018-05-30 14:14:24 master.cpp:303] Graph 0 doesn't fit with available resources                              
I 14208 2018-05-30 14:14:24 master.cpp:142] Graph 0 is removed from the master                 

Any suggestions to turn on debug?

frank-y-liu commented 6 years ago

Added log message from dtc-agent:

I 46976 2018-05-30 14:13:50 agent.cpp:135] Agent @127.0.0.1 [frontier:9913]                                           
I 46976 2018-05-30 14:13:50 agent.cpp:138] cg-subsys.memory "/sys/fs/cgroup/memory/dtc" [limit:135081242624]          
I 46976 2018-05-30 14:13:50 agent.cpp:139] cg-subsys.cpuset "/sys/fs/cgroup/cpuset/dtc" [cpus:0]                      
I 46976 2018-05-30 14:13:50 agent.cpp:140] cg-subsys.blkio "/sys/fs/cgroup/blkio/dtc" [weight:500]          

Does this mean the dtc-agent didn't get any cpu's allocated?

tsung-wei-huang commented 6 years ago

Could you please cat /sys/fs/cgroup/cpuset/dtc/cpuset.cpus and let me know what u have?

tsung-wei-huang commented 6 years ago

I have fixed a minor bug in the cgroup that might cause you to have this problem. Please update with the master branch and try it again. Let me know if the problem still exits.