microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.6k stars 3.94k forks source link

Fix latest pytorch '_get_socket_with_port' import error #5654

Closed Yejing-Lai closed 1 week ago

Yejing-Lai commented 2 weeks ago

The latest PyTorch deleted the '_get_socket_with_port' API, replacing it with 'get_free_port'.

Fixes: #5603

Yejing-Lai commented 2 weeks ago

Hi @mrwyattii. Please kindly review~ Thanks!

delock commented 2 weeks ago

@mrwyattii FYI, this is to fix for a PyTorch API change that will affect DeepSpeed running with PyTorch nightly. Thanks!

Yejing-Lai commented 2 weeks ago

Hi @mrwyattii. The failing test seems like an HTTP error. Could you please rerun the CI? Thanks!

loadams commented 2 weeks ago

@mrwyattii FYI, this is to fix for a PyTorch API change that will affect DeepSpeed running with PyTorch nightly. Thanks!

@delock - thanks, do you know what version this was added in, so we can know what the minimum pytorch version supported by this new code is?

adk9 commented 2 weeks ago

@mrwyattii FYI, this is to fix for a PyTorch API change that will affect DeepSpeed running with PyTorch nightly. Thanks!

@delock - thanks, do you know what version this was added in, so we can know what the minimum pytorch version supported by this new code is?

AFAICT, _get_socket_with_port only got removed from torch.distributed.elastic.agent.server.api recently, but get_free_port and get_socket_with_port have existed in torch.distributed.elastic.utils.distributed for a while -- at least going way back up to 3 years ago. So we shouldn't need to pin a minimum PyTorch version for this.

loadams commented 1 week ago

@mrwyattii FYI, this is to fix for a PyTorch API change that will affect DeepSpeed running with PyTorch nightly. Thanks!

@delock - thanks, do you know what version this was added in, so we can know what the minimum pytorch version supported by this new code is?

AFAICT, _get_socket_with_port only got removed from torch.distributed.elastic.agent.server.api recently, but get_free_port and get_socket_with_port have existed in torch.distributed.elastic.utils.distributed for a while -- at least going way back up to 3 years ago. So we shouldn't need to pin a minimum PyTorch version for this.

Sounds good, do you want to approve and we can get this merged @adk9 ?