kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Test case "TestDeletePodsAndServices" error #228

Closed leileiwan closed 4 years ago

leileiwan commented 4 years ago

hi, In the file "pkg/controller.v1/pytorch/job_test.go", the test case "TestDeletePodsAndServices " may have something error,

func TestDeletePodsAndServices(t *testing.T) {
  ...
  if len(fakeServiceControl.DeleteServiceName) != tc.expectedPodDeletions {
    t.Errorf("%s: unexpected number of service deletes.  Expected %d, saw %d\n", 
        tc.description, tc.expectedPodDeletions, len(fakeServiceControl.DeleteServiceName))
   }
  ...
}

Because,now only master pod have service,so I belive that the number of service deletes should be len(masterPods),not include the number of worker pods.

gaocegege commented 4 years ago

SGTM. But why did the CI pass without the change?

/cc @johnugeorge

/kind bug

leileiwan commented 4 years ago

@gaocegege
Because in this test case, we set the service for worker pod. Then it will run the function "deletePodsAndServices " which will delete services of all pods with function "DeleteService"(FakeServiceControl). In the "FakeServiceControl ",we delete a service just append it ,so the test case can pass coincidentally. Even the CI can pass this test case, but it‘s confusing. I suggest to change it. (1) not create service for worker pod (2) delete service just append master service (3) expect deleted service equal with len(master pods)

func TestDeletePodsAndServices(t *testing.T) {
  ...
  testutil.SetServices(serviceIndexer, tc.job, testutil.LabelWorker, tc.activeWorkerServices, t)
  testutil.SetServices(serviceIndexer, tc.job, testutil.LabelMaster, tc.activeMasterServices, t)
  ...
  if len(fakeServiceControl.DeleteServiceName) != tc.expectedPodDeletions {
    t.Errorf("%s: unexpected number of service deletes.  Expected %d, saw %d\n", 
        tc.description, tc.expectedPodDeletions, len(fakeServiceControl.DeleteServiceName))
   }
  ...
}
func (pc *PyTorchController) deletePodsAndServices(job *pyv1.PyTorchJob, pods []*v1.Pod) error {
    if len(pods) == 0 {
        return nil
    }

    // Delete nothing when the cleanPodPolicy is None or Running.
    if *job.Spec.CleanPodPolicy == common.CleanPodPolicyNone ||
        *job.Spec.CleanPodPolicy == common.CleanPodPolicyRunning {
        return nil
    }

    for _, pod := range pods {
        if err := pc.PodControl.DeletePod(pod.Namespace, pod.Name, job); err != nil {
            return err
        }
        // Pod and service have the same name, thus the service could be deleted using pod's name.
        if err := pc.ServiceControl.DeleteService(pod.Namespace, pod.Name, job); err != nil {
            return err
        }
    }
    return nil
}
func (f *FakeServiceControl) DeleteService(namespace string, serviceID string, object runtime.Object) error {
    f.Lock()
    defer f.Unlock()
    f.DeleteServiceName = append(f.DeleteServiceName, serviceID)
    if f.Err != nil {
        return f.Err
    }
    return nil
}
gaocegege commented 4 years ago

Gotcha. Make sense.

@leileiwan Would you like to help us with this issue? Or I can fix it.

leileiwan commented 4 years ago

@gaocegege Thanks for you give me a chance to fix it,I will try it.

gaocegege commented 4 years ago

@leileiwan

Do not be so polite.

Thanks for your contribution! :tada: :+1:

johnugeorge commented 4 years ago

Great. Thanks @leileiwan for your contribution