hku-systems / vpipe

26 stars 3 forks source link

TypeError in module cpm, mp_size is set to default NoneType #4

Closed Hyaloid closed 1 year ago

Hyaloid commented 1 year ago

Hi, I'm trying to repreducing the cpm module, when I execute python -m launch --nnodes 1 --node_rank 3 --nproc_per_node 4 main_with_runtime.py --data_dir /usr/vpipe/cpm/data/miniimagenet/train --master_addr 172.20.21.6 --module medium_4 --checkpoint_dir output --partition medium_4/vpipe.json --sync_mode asp --distributed_backend gloo -b 2 --lr 0.000600 --lr_policy polynomial --weight-decay 0.000000 --epochs 20 --print-freq 100 --verbose 0 --num_ranks_in_server 4 --config_path medium_4/mp_conf.json, and I got this error:

Traceback (most recent call last):
  File "main_with_runtime.py", line 671, in <module>
    main()
  File "main_with_runtime.py", line 274, in main
    world_size = sum(len(v) for v in configuration_maps['stage_to_rank_map'].values()) * mp_size
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Traceback (most recent call last):
  File "main_with_runtime.py", line 671, in <module>
    main()
  File "main_with_runtime.py", line 274, in main
    world_size = sum(len(v) for v in configuration_maps['stage_to_rank_map'].values()) * mp_size
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Traceback (most recent call last):
  File "main_with_runtime.py", line 671, in <module>
    main()
  File "main_with_runtime.py", line 274, in main
    world_size = sum(len(v) for v in configuration_maps['stage_to_rank_map'].values()) * mp_size
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Traceback (most recent call last):
  File "main_with_runtime.py", line 671, in <module>
    main()
  File "main_with_runtime.py", line 274, in main
    world_size = sum(len(v) for v in configuration_maps['stage_to_rank_map'].values()) * mp_size
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/vpipe/cpm/cpm/launch.py", line 173, in <module>
    main()
  File "/usr/vpipe/cpm/cpm/launch.py", line 169, in main
    cmd=cmd)

It seems that there should be 4 keys in mp_config.json, and mp_size and stage_to_depth_map should be included, but only 2 keys(module_to_stage_map, stage_to_rank_map) are in the orginal mp_config.json. Should I add the key mp_size to mp_config.json or do something else?

Any help would be so appreciated.

SimonZsx commented 1 year ago

Yes, cpm is an experimental project to extend VPipe to 3D parallel. You can refer to

https://github.com/hku-systems/vpipe/blob/main/cpm/cpm/large_8/mp_conf.json

Not all configuration files are well tested.

Hyaloid commented 1 year ago

Yes, cpm is an experimental project to extend VPipe to 3D parallel. You can refer to

https://github.com/hku-systems/vpipe/blob/main/cpm/cpm/large_8/mp_conf.json

Not all configuration files are well tested.

@SimonZsx Thanks a lot! And what does mp_size represent?

SimonZsx commented 1 year ago

Hi, it’s about model/tensor parallel dimensions. You can refer to the Megatron-LM paper. 在 2023年3月30日,17:14,SeaMount @.***> 写道:

Yes, cpm is an experimental project to extend VPipe to 3D parallel. You can refer to https://github.com/hku-systems/vpipe/blob/main/cpm/cpm/large_8/mp_conf.json Not all configuration files are well tested.

Thanks a lot! And what does mp_size represent?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you modified the open/close state.Message ID: @.***>

SimonZsx commented 1 year ago

The issue is fixed by updating Readme. So I will close it.