Some process groups are initialized with ranks in Numpy arrays and sometimes with lists. The Numpy datatypes cause issues when the profiler tries to serialize dist_info to a JSON:
[rank0]: prof.step()
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 727, in step
[rank0]: self._transit_action(prev_action, self.current_action)
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 744, in _transit_action
[rank0]: action()
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/site-packages/torch/profiler/profiler.py", line 177, in start_trace
[rank0]: self.add_metadata_json("distributedInfo", json.dumps(dist_info))
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/__init__.py", line 231, in dumps
[rank0]: return _default_encoder.encode(obj)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 200, in encode
[rank0]: chunks = self.iterencode(o, _one_shot=True)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 258, in iterencode
[rank0]: return _iterencode(o, 0)
[rank0]: ^^^^^^^^^^^^^^^^^
[rank0]: File "/soft/applications/conda/2024-04-29/mconda3/lib/python3.11/json/encoder.py", line 180, in default
[rank0]: raise TypeError(f'Object of type {o.__class__.__name__} '
[rank0]: TypeError: Object of type int64 is not JSON serializable
You can inspect the process group ranks data type with:
dist.distributed_c10d._get_all_pg_configs()
configs = dist.distributed_c10d._get_all_pg_configs()
for config in configs:
print(f"{config['ranks'][0].__class__.__name__}")
I could get around the issue by modifying nanotron.distributed.new_group and adding
if isinstance(ranks, np.ndarray):
ranks = ranks.tolist()
however I am not sure this is the fix for the long term. Looking around I see that ParallelContext.create_new_group has a type hint for np.ndarray but sometimes gets a list as well. Ideally these should always be in the same format and respect the type hints. Inserting regular integers to the torch.distributed.new_group is probably a better idea not to break things like the profiler. Torch documentation has the input as ranks being of datatype list[int] as well. An alternative to modification proposed above would be to have an assert here to ensure the input is a list and then modify the code elsewhere to only input lists to nanotron.distributed.new_group.
Python: 3.11.8
PyTorch: 2.3.0
nanotron: up to date main branch
Config: created by examples/bench_llama_7b.py with profiler enabled:
Some process groups are initialized with ranks in Numpy arrays and sometimes with lists. The Numpy datatypes cause issues when the profiler tries to serialize dist_info to a JSON:
You can inspect the process group ranks data type with:
I could get around the issue by modifying
nanotron.distributed.new_group
and addinghowever I am not sure this is the fix for the long term. Looking around I see that
ParallelContext.create_new_group
has a type hint fornp.ndarray
but sometimes gets a list as well. Ideally these should always be in the same format and respect the type hints. Inserting regular integers to thetorch.distributed.new_group
is probably a better idea not to break things like the profiler. Torch documentation has the input as ranks being of datatypelist[int]
as well. An alternative to modification proposed above would be to have an assert here to ensure the input is a list and then modify the code elsewhere to only input lists tonanotron.distributed.new_group
.Python: 3.11.8 PyTorch: 2.3.0 nanotron: up to date main branch Config: created by
examples/bench_llama_7b.py
with profiler enabled: