rapidsai / wholegraph

WholeGraph - large scale Graph Neural Networks
https://docs.rapids.ai/api/cugraph/stable/wholegraph/
Apache License 2.0
100 stars 38 forks source link

Add horovodrun launch agent for Wholegraph #200

Closed Tomcli closed 3 months ago

Tomcli commented 4 months ago

We have many users running the Kubeflow training operator who are also interested in using Wholegraph. For our MPIJobs users, many of them still use HorovodRun as the startup command. Therefore, we want to add HorovodRun as one of the Wholegraph launch agents so our users can use Wholegraph on top of Kubeflow.

The new function will be similar to the existing MPI launcher agent, where the horovod library is only imported on demand. The horovod.tensorflow library will be used solely for the Horovod initialization command due to the issue with horovod.torch (see https://github.com/horovod/horovod/issues/4009). After the Horovod initialization, the program can continue to run normal PyTorch code within each rank just like the mpi4py.

fixes #201

copy-pr-bot[bot] commented 4 months ago

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Tomcli commented 4 months ago

/label feature request

BradReesWork commented 3 months ago

/okay to test

linhu-nv commented 3 months ago

Hi @Tomcli, it seems that there are some code style issue in your code, which leads to failure of CI. It is recommended that you can use "precommit" tool to do some code style test before commit, as in here https://docs.rapids.ai/api/cuspatial/stable/developer_guide/contributing_guide/ . Can you check with precommit and then commit again? Thanks. Or if it's a bit troublesome for you, I can also open a PR and commit your codes.

Tomcli commented 3 months ago

Thank you @linhu-nv for providing the link to the contributing guide. I fixed the license check and verified with my local pre-commit check.

linhu-nv commented 3 months ago

No problem @Tomcli , @BradReesWork could you please kick off the CI again? Thanks

BradReesWork commented 3 months ago

/okay to test

BradReesWork commented 3 months ago

/merge