robolaunch / robot-operator

Kubernetes Robot Operator for ROS/2 Based Robots
https://robolaunch.github.io/robot-operator/
Apache License 2.0
22 stars 3 forks source link

BuildManager Fails When Selecting Cluster for Steps #112

Closed tunahanertekin closed 1 year ago

tunahanertekin commented 1 year ago

What happened?

Every build step has .selector field that determines if the step will be executed on current cluster. BuildManager cannot satisfy this functionality. For example, it sometimes creates a step's job in physical instance even if the step's selector indicates only cloud instance.

What did you expect to happen?

I expect to see that steps are created in desired instances.

How can we reproduce it (as minimally and precisely as possible)?

It can be reproduced by applying a BuildManager like this:

apiVersion: types.kubefed.io/v1beta1
kind: FederatedBuildManager
metadata:
  name: build-cloudy
  namespace: my-fleet
spec:
  template:
    metadata:
      labels:
        robolaunch.io/target-robot: cloudy
      name: build-cloudy
      namespace: my-fleet
    spec:
      steps:
      - name: rosdep-cloudy
        workspace: cloudy-ws
        command:  "sleep 3"
        selector:
          robolaunch.io/cloud-instance: robot-cloud-02
  placement:
    clusters:
    - name: robot-cloudy-01
    - name: robot-cloud-02

Kubernetes version

All Kubernetes versions

Container network interface (CNI) and version

No response

tunahanertekin commented 1 year ago

Apparently, this issue has not been resolved properly.

When deploying a BuildManager w/ it's manifest:

apiVersion: robot.roboscale.io/v1alpha1
kind: BuildManager
metadata:
  name: build-cloudy
  namespace: my-fleet
spec:
  steps:
  - command: |
      cd $WORKSPACES_PATH/cloudy-ws && \
      source /opt/ros/humble/setup.bash && \
      apt-get update && \
      rosdep update && \
      rosdep install --from-path src --ignore-src -y -r
    instances:
    - robot-cloud-02
    name: rosdep-cloudy
    workspace: cloudy-ws
  - command: |
      apt-get update && \
      apt-get install -y ros-humble-image-transport-plugins ros-humble-rqt-image-view
    instances:
    - robot-cloud-02
    name: compress-pkgs
    workspace: cloudy-ws
  - command: |
      cd $WORKSPACES_PATH/cloudy-ws && \
      source /opt/ros/humble/setup.bash && \
      colcon build
    instances:
    - robot-cloud-02
    name: build-cloudy
    workspace: cloudy-ws
  - command: |
      cd $WORKSPACES_PATH/cloudy-ws && \
      source /opt/ros/humble/setup.bash && \
      colcon build
    instances:
    - robot-cloud-02
    name: build-2-cloudy
    workspace: cloudy-ws
  - command: |
      cd $WORKSPACES_PATH/physical-ws && \
      source /opt/ros/humble/setup.bash && \
      apt-get update && rosdep update && \
      rosdep install --from-path src --ignore-src -y -r
    instances:
    - cloudy-mini-agv
    name: rosdep-physical
    workspace: physical-ws
  - command: |
      apt-get update && \
      apt-get install -y ros-humble-image-transport-plugins ros-humble-realsense2-camera
    instances:
    - cloudy-mini-agv
    name: camera-pkgs
    workspace: physical-ws
  - command: |
      cd $WORKSPACES_PATH/physical-ws && \
      source /opt/ros/humble/setup.bash && \
      colcon build
    instances:
    - cloudy-mini-agv
    name: build-physical
    workspace: physical-ws
  - command: |
      cd $WORKSPACES_PATH/physical-ws && \
      rosdep update && \
      source install/setup.bash && \
      source install/local_setup.bash && \
      ros2 run micro_ros_setup create_agent_ws.sh && \
      ros2 run micro_ros_setup build_agent.sh
    instances:
    - cloudy-mini-agv
    name: micro-ros-physical
    workspace: physical-ws
status:
  active: true
  phase: BuildingRobot
  scriptConfigMapStatus:
    created: true
    reference:
      apiVersion: v1
      kind: ConfigMap
      name: build-cloudy-scripts
      namespace: my-fleet
      resourceVersion: "6359"
      uid: 748cca81-cc36-4c8d-9d81-94aeaf28316a
  steps:
  - resource:
      created: true
      phase: Active
      reference:
        apiVersion: batch/v1
        kind: Job
        name: build-cloudy-rosdep-physical
        namespace: my-fleet
        resourceVersion: "6455"
        uid: 07a478e7-45bd-4d54-98f9-a11183c5878e
    step:
      command: |
        cd $WORKSPACES_PATH/physical-ws && \
        source /opt/ros/humble/setup.bash && \
        apt-get update && rosdep update && \
        rosdep install --from-path src --ignore-src -y -r
      instances:
      - cloudy-mini-agv
      name: rosdep-physical
      workspace: physical-ws
  - resource:
      created: false
      reference: {}
    step:
      command: |
        apt-get update && \
        apt-get install -y ros-humble-image-transport-plugins ros-humble-realsense2-camera
      instances:
      - cloudy-mini-agv
      name: camera-pkgs
      workspace: physical-ws
  - resource:
      created: false
      reference: {}
    step:
      command: |
        cd $WORKSPACES_PATH/physical-ws && \
        source /opt/ros/humble/setup.bash && \
        colcon build
      instances:
      - cloudy-mini-agv
      name: build-physical
      workspace: physical-ws
  - resource:
      created: false
      reference: {}
    step:
      command: |
        cd $WORKSPACES_PATH/physical-ws && \
        rosdep update && \
        source install/setup.bash && \
        source install/local_setup.bash && \
        ros2 run micro_ros_setup create_agent_ws.sh && \
        ros2 run micro_ros_setup build_agent.sh
      instances:
      - cloudy-mini-agv
      name: micro-ros-physical
      workspace: physical-ws

BuildManager creates these jobs and pods:

$ kubectl get jobs -n my-fleet
NAME                           COMPLETIONS   DURATION   AGE
build-cloudy-rosdep-cloudy     0/1           4m37s      4m37s
build-cloudy-rosdep-physical   0/1           4m37s      4m37s
$ kubectl get pods -n my-fleet
NAME                                 READY   STATUS      RESTARTS   AGE
build-cloudy-rosdep-physical-pgqpr   0/1     Error       0          5m30s
build-cloudy-rosdep-physical-vlp89   0/1     Error       0          4m32s
build-cloudy-rosdep-cloudy-9htpl     0/1     Error       0          5m30s
build-cloudy-rosdep-cloudy-75khk     1/1     Running     0          25s

Problem is that the BuildManager in the first manifest shouldn't create a job named build-cloudy-rosdep-cloudy since the name of the instance is cloudy-mini-agv. It can be observed that BuildManager's status is generated successfully, not containing the wrong step. Root cause of this error is being investigated.

tunahanertekin commented 1 year ago

~Root cause can be found in this function:~

func ContainsInstance(instances []string, instance string) bool {

    if len(instances) == 0 {
        return true
    }

    for _, v := range instances {
        if v == instance {
            return true
        }
    }
    return false
}

~By default, a newly initialized Step object has it's instances empty (so it always returns true). Irrelevant steps can be executed if they are processed before their instances are set.~

tunahanertekin commented 1 year ago

In deletion attempt of builder jobs, operator updates the instance status and puts first step to status even if it's configured to run in current instance. Here's the file that carries root cause and the solution: (solved in f136f9ab0de5afe4453e89b639cb7314dca836de) func reconcileDeleteBuilderJobs