trento-project / trento

An open cloud-native web console improving on the work day of SAP Applications administrators.
https://www.trento-project.io
Apache License 2.0
33 stars 25 forks source link

Information missing or wrong information displayed in console after full shutdown #515

Closed abravosuse closed 2 years ago

abravosuse commented 2 years ago

This issue has been noticed in the test environment that has been set up for the QA Team.

Prerequisite: we have performed a full shutdown of SAP environment env1 (prd) following the steps indicated at https://confluence.suse.com/display/SAP/SAP+landscape+for+Trento+testing+in+Azure#SAPlandscapeforTrentotestinginAzure-FullStop.

Steps to reproduce the problem (once the environment is down):

  1. Open the Trento Console URL.
  2. Click on SAP Systems in the Menu section (or in the "Go to view" link in the SAP Systems section) -> Missing: System NWP is not listed. If Trento is to be used as a monitoring tool, the system should be listed and when you click on the SID to review the system details, you should see all instance in status Stopped.
  3. Click on HANA Databases in the Menu section and expand database HDP: -> Wrong: HANA primary is showing on vmhdbprd01 with status SOK
  4. Click on HDP to review the database details -> Wrong: Instance 10 showing on vmhdbprd01 with status SAPControl-Green (instance is stopped)
  5. Click on Pacemaker Clusters in the Menu section, filter by Cluster type = HANA scale-up and Tag = env1 → Pass
  6. Click on the cluster name or ID to review the cluster details -> Wrong: SAPHanaSR health state = 4 (should be 0 or 1) -> Wrong: HANA secondary sync state = green SOK (should be SFAIL) -> Wrong: no stopped resources (all resources are stopped) -> Wrong: Details for node vmhdbprd01 shows all resources as active (all resources are inactive) -> Wrong: Details for node vmhdbprd02 shows all resources active (all resources are inactive)
  7. Click on Hosts in the Menu section and filter by Tag = env1. -> Wrong: only two hosts showing (vmdrbdprd02 and vmhdbprd01): all hosts showed be listed with a message of type "host not reachable)
  8. Click on each host to review its status. -> Wrong: All processes in vmhdbprd01 (HANA primary node) showing in status green Running
abravosuse commented 2 years ago

Funny enough, now that the environment is down, the health section in the cluster detail view shows which checks are is Passing or Warning or Critical status. Before, this section didn't show any checks, despite all checks being selected (https://github.com/trento-project/trento/issues/507).

As a matter of fact, that's the case for the two clusters that are still up and running: checks are selected but nothing is shown in the Health section.

Zaoliang commented 2 years ago

Funny enough, now that the environment is down, the health section in the cluster detail view shows which checks are is Passing or Warning or Critical status. Before, this section didn't show any checks, despite all checks being selected (#507).

As a matter of fact, that's the case for the two clusters that are still up and running: checks are selected but nothing is shown in the Health section.

yes, I can confirm on this delayed behavior. I waited longer than 1 hour for any changes on Consule. I think this is a major issue if Trento server cannot detect status of hosts in time for: status of host and agent If trento server is not able to detect status of agent, it should show 'unknown' or 'red' status on Consule.

arbulu89 commented 2 years ago

@abravosuse I have tried to follow your steps. First of all, keep in mind that if you stop trento, obviously the data cannot be updated. I don't know when you did start trento again. Using the current latest code I have the next (https://github.com/trento-project/trento/commit/9523ce6b536ac2f480bb0f9079f9af50d0786f14)

-> Missing: System NWP is not listed. If Trento is to be used as a monitoring tool, the system should be listed and when you click on the SID to review the system details, you should see all instance in status Stopped. This is actually what I see: image

The layout is not shown properly in the Database side, as having the things with system replication made the things more difficult. Anyway, the SAP System page is under construction, and it will disabled temporarily until having something better

  • Click on HANA Databases in the Menu section and expand database HDP: -> Wrong: HANA primary is showing on vmhdbprd01 with status SOK

What should we show here?

  • Click on HDP to review the database details -> Wrong: Instance 10 showing on vmhdbprd01 with status SAPControl-Green (instance is stopped) I see it as stopped: image

  • Click on Pacemaker Clusters in the Menu section, filter by Cluster type = HANA scale-up and Tag = env1 → Pass

  • Click on the cluster name or ID to review the cluster details -> Wrong: SAPHanaSR health state = 4 (should be 0 or 1) -> Wrong: HANA secondary sync state = green SOK (should be SFAIL) -> Wrong: no stopped resources (all resources are stopped) -> Wrong: Details for node vmhdbprd01 shows all resources as active (all resources are inactive) -> Wrong: Details for node vmhdbprd02 shows all resources active (all resources are inactive)

More doubts here. What should we show? Here the cluster is in maintenance. In this state, the resources are not working so we don't get updates. If the cluster is stopped, again, we cannot get any information, so the unique thing we could show is cluster not running. All of these requests should be refined. The maintenance and cluster stopped mode has not been implemented. So telling that they work wrong is not really true hehe

  • Click on Hosts in the Menu section and filter by Tag = env1. -> Wrong: only two hosts showing (vmdrbdprd02 and vmhdbprd01): all hosts showed be listed with a message of type "host not reachable)

This works pretty well in my case. Only node 1 is not working, and it is reported as an error: image

As a summary, I would argue that most of the comments in this issue are real bugs. Basically, they are things that are not even implemented. So, question to the stake holders @lee-martin @stefanotorresi @abravosuse

abravosuse commented 2 years ago

@arbulu89 what we are trying to see with this test is the reaction of the Trento console when Trento Server cannot reach an agent nor its host via SSH (this can be the situation due to a network issue or, like in this case, because the server crashed and is down). In my opinion, in this situation the Hosts view should show that the host is in Critical status and when you click on the detail view you should get a message saying that Trento server cannot reach the Agent nor the host itself. That would urge the admin to go an see what the heck is happening with that host.

Then, any service provided by that particular host (SAP Instance, HANA instance, cluster resource) should show in status unknown. In certain cases (NetWeaver instance, cluster resources) I said the status should be stopped. But I realize now that's wrong. Trento cannot know what the status of the service therefore the most it can say about it is that its status is unknown. Anything else would be wrong, right?

@lee-martin @stefanotorresi what do you think?

stefanotorresi commented 2 years ago
  • What should we show when the cluster is in maintenance?

A warning.

Should we stop the checks?

Yes. I don't think it makes sense to check the configuration of a cluster that is in maintenance, because that means that someone is probably working on it.

Should we put some health state?

Yes, a warning. 😉

stefanotorresi commented 2 years ago

when Trento Server cannot reach an agent nor its host via SSH (this can be the situation due to a network issue or, like in this case, because the server crashed and is down). In my opinion, in this situation the Hosts view should show that the host is in Critical status and when you click on the detail view you should get a message saying that Trento server cannot reach the Agent nor the host itself. That would urge the admin to go an see what the heck is happening with that host.

We already discussed this: no, this is not a critical situation, it's only a warning, because the fact that Trento Runner fails to execute the checks for whatever reason doesn't mean that the workloads in the hosts are not up. The source of truth of the hosts' health is the Agent: only if the Agent stops reporting to the Server, then it's a critical condition.

abravosuse commented 2 years ago

v0.6.0 (QA testing cycle #2): essentially the same results as in cycle 1 with the following differences:

  1. Click on HANA Databases in the Menu section and expand database HDP: -> Wrong: HANA primary is not listed at all (and should be listed as a registered database with status unknown)
  2. Click on Hosts in the Menu section and filter by Tag = env1. -> Wrong: in this cycle no hosts with tag env1 are listed
abravosuse commented 2 years ago

v0.7.0. Much improvement from previous version:

  1. Click on SAP Systems in the Menu section (or in the "Go to view" link in the SAP Systems section) -> System is listed (good) and instances showing in Stopped (SAPControl-Grey) status (much better than Green). Unknown status would be more accurate, given that Trento cannot connect to the Agent and therefore doesn't really know what's going on.
  2. Click on HANA Databases in the Menu section and expand database HDP: -> Both instances shown but without any roles assigned (good).
  3. Click on HDP to review the database details -> Both instances showing in Stopped (SAPControl-Grey) status (much better than Green). Unknown status would be more accurate, given that Trento cannot connect to the Agent and therefore doesn't really know what's going on.
  4. Click on the cluster name or ID to review the cluster details -> Details for site NBG show all resources as active.
  5. Click on Hosts in the Menu section and filter by Tag = env1. -> All hosts showing in critical status (great)
  6. Click on each host to review its status. -> Status shows Agent is not running (good).
abravosuse commented 2 years ago

This entire discussion falls under the topic what to do when Trento Server cannot reach a Trento Agent and how that should be visualized in the console. As discussed in issue #491, this has been put aside as a potential enhancement for future versions, and therefore can be closed.