Enhancement: Secure Data Transmission for all_reduce in TDX-based Distributed ML Training

Dear oneCCL Team,

We are reaching out to request an enhancement for Intel oneCCL that targets secure data transmission for distributed machine learning (ML) training workloads. Specifically, We are looking for built-in encryption within oneCCL’s all_reduce operation, which is critical for secure gradient sharing across nodes equipped with Intel Trust Domain Extensions (TDX).

Use Case: Our ML training workflows utilize PyTorch’s Distributed Data Parallel (DDP) running on a cluster of TDX-enabled nodes. While TDX provides a robust isolated execution environment, ensuring data security during all_reduce operations between TDX machines is essential for maintaining the confidentiality of sensitive gradient information.

Requirement: The feature should enable encryption (preferably conforming to standard protocols such as TLS) for data payloads being communicated across nodes during all_reduce. The goal is to ensure that the in-flight data is protected, complementing TDX's at-rest and in-use security capabilities.

Justification: Guards against the interception of sensitive data during distributed training Transparently fortifies existing ML workflows without altering user code Helps maintain the security posture promised by TDX throughout the data lifecycle

We understand performance is critical, hence suggesting this as an optional toggle where secure transmission could be enabled based on user demand.

Looking forward to your thoughts on this proposal. Thanks for your commitment to advancing collective communications.

Best regards

intel / torch-ccl

Enhancement: Secure Data Transmission for all_reduce in TDX-based Distributed ML Training #61