lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
500 stars 151 forks source link

Define a structured JSON schema for transporting kernel status data to backend.ai-webui #679

Open rapsealk opened 2 years ago

rapsealk commented 2 years ago

Is your feature request related to a problem? Please describe. We need to define a structured JSON schema for transporting kernel status data to backend.ai-webui. Although we already support responding to clients with kernel status data, it is hard to interpret on the client side since the structure is not unified; currently, we are sending raw format for each data which has all different shapes.

Here is an example code: https://github.com/lablup/backend.ai-webui/blob/79eed187adf1b506f05e8417d4f223dabcc57ef7/src/components/backend-ai-session-list.ts#L1443-L1587

In the code above, the layout depends on 4 types of data; kernel, session, scheduler and error. The problem is that each of these has a different shape.

// Kernel
{
  "kernel": {
    "exit_code": 0,
  }
}

// Session
{
  "session": {
    "status": "...",
  }
}

// Scheduler
{
  "scheduler": {
    "msg": "...",
    "retries": 0,
    "last_retry": "...",
    "passed_predicates": [
      {
        "name": "...",
      },
    ],
    "failed_predicates": [
      {
        "name": "...",
        "msg": "...",
      },
    ],
  },
}

It is even worse in the case of Error. There are 2 types of errors;

{
  // Type 1
  "error": {
    "name": "...",
    "agent_id": "...",
    "repr": "...",
  },
  // Type 2
  "error": {
    "collection": [
      {
        "name": "...",
        "agent_id": "...",
        "repr": "...",
      },
    ],
  }
}

Describe the solution you'd like Define a unified structure of JSON schema for transporting and interpreting kernel status data.

Describe alternatives you've considered

Additional context

soheeeeP commented 2 years ago

@rapsealk How about defining JSON schema with predicate_check and predicates_detail? I'd like to ask whether the schema below catches the right direction of this issue or not.

lizable commented 2 years ago

@rapsealk First of all, thank you so much for bringing up the topic which could be very helpful in printing out HTML templates on WebUI. FWIW, we have two options and They each have pros and cons, obviously. I wrote down about them below, It would be appreciated if you pick one with the reasons.

The first option is to use DTO(Data transfer object) between responses of all kernel status(kernel, status, scheduler, error(type 1), error(type 2)), Which doesn't need any refactoring on responses that scattered all over the manager component, so won't take a lot of time to find and refactor to unified data schema. There are also drawbacks to this option of course. For example, we need to decide where to put the DTO layer which is highly related to whom to manage that. If we settle to server-side(which would be in manager component), then It would be handled in there, with extra management. If we determined to the client-side(in WebUI), then WebUI needs to think of the extensibility of the data so that It reduces errors because of updates on the data structure of the server-side, which requires a thorough understanding(probably philosophy) of the context in server-side.

The other option is to define a good, extensible data structure on the server side by refactoring every response related to kernel status. This would be a bit painful since all of the responses about kernel status are scattered to functions and wrapped without any unified, but detailed layers for now, and the assignee of this issue should merge and generalize into one structured type (considering the extension, of course). But It's worth doing it if we managed to achieve this, then there would be no ignored kernel status that doesn't fit with the current status, and also no nested conditional HTML templates on client-side. But As I mentioned right before, All that matters is whether we could unify the scattered data and think about extensibility for the future, which needs a rich understanding of every kernel status means and how the status is received from agent or other components.

rapsealk commented 2 years ago

@soheeeeP Thank you for the suggestion and for showing interest. However, I'm sorry to say that this issue has not been fixed yet(as you can see above), so it is hard to say which direction would be right. We need more conversations to make a consensus about how to approach on this issue. We will let you know when it is clearly ready to be started.