vmware-archive / pcf-pipelines

PCF Pipelines
Apache License 2.0
158 stars 283 forks source link

Method to check OpsMgr running tasks may fail when old pending tasks in DB exists #86

Closed lsilvapvt closed 7 years ago

lsilvapvt commented 7 years ago

This problem happened in two distinct PCF 1.9.2 environments of a customer that deployed the pcf-pipelines (tested with v0.8, v0.11 and v0.13.2) : the tasks for Apply-Change and Wait-for-Opsmgr of the Upgrade-Tile pipeline both return that a task is already running in OpsMgr even though there is no one started in the Ops Mgr UI. That return code prevents the pipeline from proceeding to the Apply-Changes phase of the upgrade, requiring the customer to use the OpsMgr UI to continue.

The root cause: We found out that in the OpsMgr's API "installations" output contained a task a couple of months old that was still in "running" state and with no finished_at date (see example below). That entry caused the tasks mentioned above to incorrectly return that a task is already running (even the apply-changes command of the om tool v0.23 fails because of it) because their methods simply check for the existence of an entry with "running" state. According to the customer, what seems to have caused that situation was the reboot of the OpsMgr VM after the corresponding running action got stuck. Apparently OpsMgr left that entry unchanged in its installs table after the reboot and never updated it to failed state.

{
"user_name": "admin",
"finished_at": null,
"started_at": "2017-02-01T17:38:42.941Z",
"status": "running",
"additions": [],
"deletions": [],
"updates": [
{
  "identifier": "p-rabbitmq",
   ...
},
...
],
"id": 17
},

Potential solutions: A) Update both wait-for-opsmgr task.sh and the om tool to parse the recent OpsMgr events json file instead of just searching for an entry with "running" status; OR

B) Provide a Known-Issues readme in the pcf-pipelines package describing the issue above and the workaround below to fix those event entries in the OpsMgr DB: 1) Make a backup copy of OpsMgr settings (add link to docs) 2) SSH to the OpsMgr VM and become root (sudo su -) 3) Switch to postgres user (sudo su postgres) 4) Execute command psql (no password required) 5) Connect to the DB:

\connect tempest_production
6) Find the id of the task in "running" state SELECT from installs WHERE status='running'; 7) Change the status of the corresponding entry: UPDATE installs SET status='failed', finished_at='2017-05-04T13:58:39.620Z', finished=true WHERE id='<ID-NUMBER-FROM-PREVIOUS-STEP>';

cf-gitbot commented 7 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

krishicks commented 7 years ago

I think a combination of adding a Known Issues section to the README and updating wait-for-opsmgr to check the OpsMgr events file would be good.

ryanpei commented 7 years ago

cc @sadvani

abbyachau commented 7 years ago

hi @lsilvapvt i believe we will be addressing this issue here: https://github.com/pivotal-cf/pcf-pipelines/pull/177. please let us know if you have any feedback by following this tracker story: https://www.pivotaltracker.com/story/show/150672203. i'll be closing this issue in favour of the aforementioned tracker issue. i'll add a link in that tracker story to this issue so we have a record of it. thanks.