vmware-tanzu-labs / educates-training-platform

A platform for hosting interactive workshop environments in Kubernetes, or on top of a local container runtime.
https://docs.educates.dev
Apache License 2.0
63 stars 15 forks source link

Workshop environment gets stuck in STARTING state. #435

Closed GrahamDumpleton closed 2 weeks ago

GrahamDumpleton commented 2 weeks ago

Describe the bug

A workshop environment has been observed as getting stuck in the STARTING state from the perspective of the training portal. Unfortunately logs for period of time when this occurred were not available.

A workshop environment is only progressed to RUNNING state by the training portal when it has received a kopf.Event for the WorkshopEnvironment resource, and the status.educates.workshop.uid field exists.

The handler for receiving this event is coded as:

@kopf.on.event(
    f"training.{settings.OPERATOR_API_GROUP}",
    "v1beta1",
    "workshopenvironments",
    when=lambda event, labels, **_: event["type"] in (None, "MODIFIED")
    and labels.get(f"training.{settings.OPERATOR_API_GROUP}/portal.name", "")
    == settings.PORTAL_NAME,
)
def workshop_environment_event(
    event, meta, body, **_
):  # pylint: disable=unused-argument
    """This is the entrypoint for handling event notifications for the
    WorkshopEnvironment resource. We watch for these so we know when the
    details of a workshop are added to the status of the workshop environment
    resource signalling that the workshop environment has been created. When
    this is seen, use that to progress the state of the workshop environment
    to running.

    """

Of note, the handler filters on event type and only does something if type is None or MODIFIED.

The None value is believed to occur when process starts up and the resource already existed before registering to watch for events. The MODIFIED string value is when the resource has been modified.

The idea is that one would get the MODIFIED event when the resource status is updated by the session manager operator to add the details of the workshop to the workshop environment resource.

An event handler can also receive ADDED and DELETED events.

We definitely don't care about DELETED since can't do anything at that point anyway.

As to ADDED, it is speculated should not filter that out and should also process it. This is because what may be occurring is that the ADDED and MODIFIED events came so close in time that the kopf framework collapsed the two into a single event which it labelled it as ADDED rather than passing through two separate events.

In other words, speculating that this is a timing issue and the process was processing things slow enough for kopf to coalesce events together.

Additional information

Due to https://github.com/vmware-tanzu-labs/educates-training-platform/issues/434 the workshop environment could not be refreshed from the training portal admin pages.

Also, the training portal code also has:

@kopf.on.event(
    f"training.{settings.OPERATOR_API_GROUP}",
    "v1beta1",
    "trainingportals",
    when=lambda event, name, uid, annotations, **_: name == settings.PORTAL_NAME
    and uid == settings.PORTAL_UID
    and event["type"] in (None, "MODIFIED"),
)
@resources_lock
@transaction.atomic
def training_portal_event(event, name, body, **_):

For good measure suggest having this also process event when is ADDED.

All other instance of events handlers across training portal, secrets manager and session manager always process ADDED as well as MODIFIED.