issues
search
kubeflow
/
mpi-operator
Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
417
stars
209
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
MPI Job fails on EKS with > 2 instances of 128 core per instance
#594
AymenFJA
closed
8 months ago
2
Port conficts will occur when multiple pods dispatched to the same node under hostnetwork.
#593
Saturnoul
opened
9 months ago
3
pod priority was assigned to 0 though the priorityclassname of the podgroup had been assigned
#592
Robin7831
closed
10 months ago
2
Add the ability to disable worker/launcher pod name suffix
#591
AymenFJA
closed
10 months ago
2
Add PITS Global Data Recovery Services as an adopter
#590
pheianox
closed
10 months ago
2
update volcano scheduler to 1.8.0; ignore vendor
#589
lowang-bh
closed
9 months ago
2
OpenMPI 4.1.5
#588
bdevcich
opened
10 months ago
11
Add actions pipeline to publish images
#587
tenzen-y
opened
10 months ago
1
[feature]upgrade volcano to v1.8.0
#586
lowang-bh
closed
9 months ago
3
Upgrade scheduler-plugins to v0.26.7
#585
tenzen-y
closed
11 months ago
4
Upgrade K8s dependencies to v0.27.4
#584
tenzen-y
closed
11 months ago
3
Upgrade Go version to v1.20
#583
tenzen-y
closed
11 months ago
2
MPIJobs with Kubernetes Python SDK
#582
AymenFJA
closed
11 months ago
4
Connection dropped after 24 hours
#581
sheevy
closed
11 months ago
2
Expected ssh contract that must be followed by images to use this operator
#580
aavbsouza
closed
11 months ago
3
add custom setup.py to install mpijob module
#579
vsoch
closed
11 months ago
6
python setup.py doesn't appear to install?
#578
vsoch
closed
11 months ago
9
Bump google.golang.org/grpc from 1.47.0 to 1.53.0
#577
dependabot[bot]
closed
12 months ago
1
Increase the timeout for E2E tests
#576
sheevy
closed
1 year ago
8
Allow to change registry via a variable
#575
sheevy
closed
9 months ago
7
Multiple MPI jobs via multiple launchers?
#574
AymenFJA
closed
1 year ago
6
Update version of debian for Docker images
#573
sheevy
opened
1 year ago
7
strange backup in hack/python-sdk/gen-sdk.sh
#572
lowang-bh
opened
1 year ago
2
merge kubeflow/common.v1 to mpi-operator
#571
lowang-bh
closed
12 months ago
7
e2e test failed sometime
#570
lowang-bh
closed
1 year ago
4
add volcano gang-schedule integration and e2e test
#569
lowang-bh
closed
1 year ago
3
Add integration and e2e test for the volcano integration
#568
tenzen-y
closed
1 year ago
3
(integration) deepspeed_mpi specific container, deepspeed_config for MPI with nodetaints
#567
ghost
closed
11 months ago
3
add volcano gang-scheduler pg min resource calculation
#566
lowang-bh
closed
1 year ago
4
Add support for linux/arm64 and linux/ppc64le for MPICH
#565
sheevy
opened
1 year ago
7
Copy APIs from common repo into here
#564
tenzen-y
closed
12 months ago
3
Release 0.5.0
#563
tenzen-y
closed
2 months ago
11
MPICH support
#562
sheevy
closed
1 year ago
14
Fix a bug that the PodGroupCtrl can not list priorityclass
#561
tenzen-y
closed
1 year ago
2
how can i deploy distributed training on kubernete clusters with torch.distributed.launch
#560
ThomaswellY
opened
1 year ago
3
Bumping controller-gen to fix unknown field error
#559
tenzen-y
closed
1 year ago
1
questions about applying for nodes and gpus
#558
ThomaswellY
opened
1 year ago
9
Commonize function newCleanPodPolicy()
#557
tenzen-y
closed
1 year ago
2
Pass a kubernetes version to E2E
#556
tenzen-y
closed
1 year ago
1
Run E2E with various kubernetes versions
#555
tenzen-y
closed
1 year ago
1
Replace https://bootstrap.pypa.io/get-pip.py with https://bootstrap.pypa.io/pip/3.6/get-pip.py in horovod example Dockerfile
#554
yeahdongcn
closed
1 year ago
2
Pod scheduling conundrum
#553
sheevy
closed
1 year ago
8
Seeing Launcher have two status: {'active': 1, 'failed': 1}
#552
kkolli
closed
1 year ago
2
Fix broken README link
#551
benash
closed
1 year ago
2
(add) mpi_job_duration_histogram metric with linearBuckets
#550
ghost
closed
1 year ago
2
(add) deepspeed_mpi specific container, deepspeed_config for MPI with nodetaints
#549
ghost
closed
1 year ago
13
Setting Intel MPI architectures separately
#548
alculquicondor
closed
1 year ago
4
Add written permission to the dep-manifests dir
#547
tenzen-y
closed
1 year ago
1
Running make scheduler-plugins-chart fails on second run
#546
alculquicondor
closed
1 year ago
3
Add versioned labels to images
#545
alculquicondor
closed
1 year ago
8
Previous
Next