Open ADAM-CT opened 4 years ago
Not sure I have enough information here to debug.
What model are you running? How did you generate the model if it's not one of the standard models we provide in this repository? What does your configuration file look like?
I used vgg16
/runtime/image_classification/models/vgg16/gpus=16/hybrid_conf.json { "module_to_stage_map": [0, 1, 2, 3, 4, 5, 5], "stage_to_rank_map": {"0": [0, 1, 2, 3, 4, 5, 6, 7], "1": [8, 9], "2": [10, 11], "3": [12], "4": [13, 14], "5": [15]} } ~
Hmm, that hybrid file seems different from what's checked in in this repository.
{
"module_to_stage_map": [0, 1, 1],
"stage_to_rank_map": {
"0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
"1": [15]
},
"stage_to_depth_map": {
"0": 1,
"1": 0
}
}
CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir ../../data/
pythonoptimizer_graph_hierarchical.py
-f ../profiler/image_classification/profiles/vgg16/graph.txt
-n 8 2
--activation_compression_ratio 1
-o vgg16_partitioned
-b 4294967296 1879048192
python convert_graph_to_model.py
-f vgg16_partitioned/gpus=16.txt
-n VGG16Partitioned
-a vgg16
-o ../runtime/image_classification/models/vgg16/gpus=16
--stage_to_num_ranks 0:8,1:2,2:4,3:1,4:1
I checked the generated configuration file again and again hybrid config file :
{ "module_to_stage_map": [0, 1, 2, 3, 4, 4], "stage_to_rank_map": {"0": [0, 1, 2, 3, 4, 5, 6, 7], "1": [8, 9], "2": [10, 11, 12, 13], "3": [14], "4": [15]} }
I don't know why it's different from yours.But I checked alexnet. gpus = 16. hybrid_conf.json, and there was no "stage_to_depth_map".
{ "module_to_stage_map": [0, 1, 1], "stage_to_rank_map": {"0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "1": [15]} }
Any updates on this? I can't find a "stage_to_depth_map" myself either
Same question here, I only find "module_to_stage_map" and "stage_to_rank_map" in config files.
server1: 8gpus server2: 8gpus The built-in distribution of torch can succeed, but there is an error using pipedream The error information is as follows:
Finished initializing process group; backend: gloo, rank: 14, world_size: 16 Replicating stage: ranks=2, module_size=3151872.000 Send ranks: {'out4': [15], 'target': [15]} Receive ranks: {'out3': [12], 'target': [12]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 10008 minibatches Traceback (most recent call last): File "main_with_runtime.py", line 578, in
main()
File "main_with_runtime.py", line 307, in main
train(train_loader, r, optimizer, epoch)
File "main_with_runtime.py", line 355, in train
r.run_forward()
File "../runtime.py", line 498, in run_forward
self.receive_tensors_forward()
File "../runtime.py", line 426, in receive_tensors_forward
backward=False)
File "../communication.py", line 592, in recv
index = self.get_messaging_index(sending=False)
File "../communication.py", line 496, in get_messaging_index
self.fwd_messaging_scheduling_row][
IndexError: list index out of range