tensorchord / envd

🏕️ Reproducible development environment
https://envd.tensorchord.ai/
Apache License 2.0
1.93k stars 156 forks source link

feat(distributed): Support distributed training debug #1355

Open gaocegege opened 1 year ago

gaocegege commented 1 year ago

Description

The CUJ looks like:

envd run --image xx --replicas 20

Then there will be one interactive shell, and users can type a command, which will run in all replicas.

Then the STDOUT will be aggregated in the terminal.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

gaocegege commented 1 year ago

It is blocked by #1283