issues
search
kubeflow
/
mpi-operator
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430
stars
216
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Introduce ManagedBy field in RunPolicy
#650
mszadkow
opened
1 week ago
3
How the file at tensorflow-benchmarks.yaml can run an MPI job ?
#649
luancaarvalho
opened
2 weeks ago
3
What scale can mpi-operator support?
#648
yxzhao6
opened
1 month ago
3
Worker pods not cleaned up upon `MPIJobEvicted` event
#647
shaowei-su
opened
1 month ago
0
Add support for the managedBy field
#646
mimowo
opened
1 month ago
6
Question: Is the network traffic of AllReduce(like, ML gradients) encrypted between workers?
#645
jsyqrt
closed
3 months ago
10
ttlSecondsAfterFinished for MPIJob, not only launcher
#644
hy00nc
opened
4 months ago
6
"cleanPodPolicy: All" does not clean up launcher pod
#643
hy00nc
opened
4 months ago
1
Connection reset
#642
bbenshab
closed
4 months ago
4
how could mpijob of mpi operator worker get the hostname of launcher
#641
Oneal65
closed
4 months ago
2
fix #639 provide NCCL tests example
#640
samos123
opened
5 months ago
1
NCCL tests example
#639
samos123
opened
5 months ago
1
Update image tag with 0.5
#638
tenzen-y
closed
5 months ago
2
Upgrade golang and controller-gen
#637
tenzen-y
closed
5 months ago
2
Upgrade golang and controller-gen
#636
alculquicondor
closed
5 months ago
9
Replace original pointer methods with ptr libs
#635
tenzen-y
closed
5 months ago
6
Introduce resource multiplication
#634
tenzen-y
closed
5 months ago
4
Upgrade K8s dependencies to v1.29
#633
tenzen-y
closed
5 months ago
12
Promote @tenzen-y to approver
#632
terrytangyuan
closed
5 months ago
2
Prepare for release 0.5.0
#631
alculquicondor
closed
5 months ago
5
Remove unnecessary RBAC rule for mpijobs-admin***
#630
vishvajit79
opened
6 months ago
2
Bump google.golang.org/protobuf from 1.31.0 to 1.33.0
#629
dependabot[bot]
closed
6 months ago
2
Fix: no overwrite when run launcher as worker
#628
kuizhiqing
closed
7 months ago
1
Deprecated pointer, use ptr instead
#627
kuizhiqing
closed
7 months ago
2
make namespace parsing and informers pluggable
#626
emsixteeen
opened
7 months ago
9
removing klog.Fatalf in favor of a shutdown request
#625
emsixteeen
closed
8 months ago
6
adding Mac .DS_Store to gitignore
#624
emsixteeen
closed
8 months ago
1
update auto gen file year to verify generate
#623
kuizhiqing
closed
8 months ago
2
Fix: add ns filter to podLister
#622
kuizhiqing
closed
8 months ago
3
Wrong host info in discover_hosts.sh
#621
kuizhiqing
closed
8 months ago
0
Running in a subset of namespaces
#620
emsixteeen
opened
8 months ago
8
Fails mpi-operator early if access to list or watch objects is denied
#619
emsixteeen
closed
8 months ago
8
adding timeout for cache sync
#618
emsixteeen
closed
8 months ago
14
fix the condition
#617
wang-mask
opened
8 months ago
12
change1 mv to cp
#616
wang-mask
closed
8 months ago
3
The operator still creates the launcher when launcherCreationPolicy is "WaitForWorkersReady" and suspend is "true"
#615
wang-mask
opened
8 months ago
0
"make generate" command run failed
#614
wang-mask
closed
8 months ago
0
Replace the plain pod workers with Indexed Job
#613
tenzen-y
opened
9 months ago
4
run worker process in launcher pod
#612
kuizhiqing
closed
7 months ago
31
Work with DeepSpeed for large scale training
#611
kuizhiqing
opened
9 months ago
28
add deepspeed example
#610
kuizhiqing
opened
9 months ago
5
Bump golang.org/x/crypto from 0.14.0 to 0.17.0
#609
dependabot[bot]
closed
9 months ago
2
When when WaitForWorkersReady is enabled in MPI operator, MPI operator and gang scheduler are in a deadlock
#608
yzhao-2023
opened
9 months ago
4
the object has been modified; please apply your changes to the latest version and try again
#607
gl-001
opened
10 months ago
8
fix bug about status absence when worker pod spec is invalid
#606
congpeiqing
opened
10 months ago
1
which is the latest mpi job definition between mpi-operator and training operator
#605
sxwl-donggang
closed
10 months ago
4
Cant get mpijob status when pod template is invalid
#604
congpeiqing
opened
10 months ago
9
Bumping opentelemetry libraries
#603
tenzen-y
closed
10 months ago
2
Bump go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc from 0.35.0 to 0.46.0
#602
dependabot[bot]
closed
10 months ago
4
Fix invalid link for horovod cpu-only example Dockerfile
#601
lianghao208
closed
11 months ago
2
Next