msr-fiddle / pipedream

MIT License
379 stars 117 forks source link

Multi-machine distribution problem #34

Open ADAM-CT opened 4 years ago

ADAM-CT commented 4 years ago

server1: 8gpus server2: 8gpus The built-in distribution of torch can succeed, but there is an error using pipedream The error information is as follows:

Finished initializing process group; backend: gloo, rank: 14, world_size: 16 Replicating stage: ranks=2, module_size=3151872.000 Send ranks: {'out4': [15], 'target': [15]} Receive ranks: {'out3': [12], 'target': [12]} Setting up process groups for broadcasts... Letting in 1 warm-up minibatches Running training for 10008 minibatches Traceback (most recent call last): File "main_with_runtime.py", line 578, in main() File "main_with_runtime.py", line 307, in main train(train_loader, r, optimizer, epoch) File "main_with_runtime.py", line 355, in train r.run_forward() File "../runtime.py", line 498, in run_forward self.receive_tensors_forward() File "../runtime.py", line 426, in receive_tensors_forward backward=False) File "../communication.py", line 592, in recv index = self.get_messaging_index(sending=False) File "../communication.py", line 496, in get_messaging_index self.fwd_messaging_scheduling_row][ IndexError: list index out of range

deepakn94 commented 4 years ago

Not sure I have enough information here to debug.

What model are you running? How did you generate the model if it's not one of the standard models we provide in this repository? What does your configuration file look like?

ADAM-CT commented 4 years ago

I used vgg16

/runtime/image_classification/models/vgg16/gpus=16/hybrid_conf.json { "module_to_stage_map": [0, 1, 2, 3, 4, 5, 5], "stage_to_rank_map": {"0": [0, 1, 2, 3, 4, 5, 6, 7], "1": [8, 9], "2": [10, 11], "3": [12], "4": [13, 14], "5": [15]} } ~

deepakn94 commented 4 years ago

Hmm, that hybrid file seems different from what's checked in in this repository.

{
    "module_to_stage_map": [0, 1, 1],
    "stage_to_rank_map": {
        "0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
        "1": [15]
    },
    "stage_to_depth_map": {
        "0": 1,
        "1": 0
    }
}
ADAM-CT commented 4 years ago

I execute the code in strict accordance with readme.txt:

step1:

         CUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir ../../data/

step2:

        pythonoptimizer_graph_hierarchical.py 
        -f ../profiler/image_classification/profiles/vgg16/graph.txt 
        -n 8 2  
        --activation_compression_ratio 1 
        -o vgg16_partitioned   
        -b 4294967296  1879048192

step3:

        python convert_graph_to_model.py 
       -f vgg16_partitioned/gpus=16.txt 
       -n VGG16Partitioned 
       -a vgg16 
       -o ../runtime/image_classification/models/vgg16/gpus=16
       --stage_to_num_ranks 0:8,1:2,2:4,3:1,4:1

I checked the generated configuration file again and again hybrid config file :

{ "module_to_stage_map": [0, 1, 2, 3, 4, 4], "stage_to_rank_map": {"0": [0, 1, 2, 3, 4, 5, 6, 7], "1": [8, 9], "2": [10, 11, 12, 13], "3": [14], "4": [15]} }

I don't know why it's different from yours.But I checked alexnet. gpus = 16. hybrid_conf.json, and there was no "stage_to_depth_map". { "module_to_stage_map": [0, 1, 1], "stage_to_rank_map": {"0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], "1": [15]} }

siddharth9820 commented 3 years ago

Any updates on this? I can't find a "stage_to_depth_map" myself either

lllukehuang commented 2 years ago

Same question here, I only find "module_to_stage_map" and "stage_to_rank_map" in config files.