intel / intelRSD

Intel® Rack Scale Design Reference Software
http://intel.com/IntelRSD
101 stars 55 forks source link

Service didn't be removed by ServiceRemovalTask #82

Closed rwleea closed 6 years ago

rwleea commented 6 years ago

Hi Intel,

I have suffered a strange case in my environment, I restart PSME, and my expectation is ExternalServiceAvailabilityCheckerTask would be trigger, and mark this PSME as unreachable, and after about 10s ServiceRemovalTask would cancel tasks and remove all entities which owned by this PSME.

However the result is after this PSME be marked as unreachable by ExternalServiceAvailabilityCheckerTask , the next round ServiceRemovalTask didn't remove this PSME.

there are my configs:

external-service.json : "RetainUnavailableServicesForHours" : 0 and I didn't modify TASK_DELAY_SECONDS in ServiceRemovalTaskScheduler, I use default value (10s)

And there are the logs:

  1. PSME is marked as unreachable at 20:56:02 2018-08-01 20:56:02,452 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] DEBUG c.i.p.c.s.SerialExecutorImpl - Coordinated Task (async) com.intel.podm.discovery.external.ExternalServiceAvailabilityCheckerTask$Proxy$_$$WeldSubclass@a6beb83e/8b8a43dc-8ca3-11e8-9e02-17983f95fdb3 - started 2018-08-01 20:56:02,452 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] DEBUG c.i.p.d.e.ExternalServiceAvailabilityCheckerTask - Verifying service with UUID 8b8a43dc-8ca3-11e8-9e02-17983f95fdb3 2018-08-01 20:56:02,759 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] INFO c.i.p.d.e.ExternalServiceAvailabilityCheckerTask[ServiceLifecycle] - Service ExternalService {UUID=8b8a43dc-8ca3-11e8-9e02-17983f95fdb3, baseUri=https://10.3.0.16:8415/redfish/v1, type=PSME} has been set to ABSENT state. Scheduling removal of service after PT0S 2018-08-01 20:56:02,956 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] DEBUG c.i.p.c.s.SerialExecutorImpl - Coordinated Task (async) com.intel.podm.discovery.external.ExternalServiceAvailabilityCheckerTask$Proxy$$$_WeldSubclass@a6beb83e/8b8a43dc-8ca3-11e8-9e02-17983f95fdb3 - ended

  2. but this PSME has not been removed by ServiceRemovalTask in 20:56:14, it has been removed at 20:56:26 instead. 2018-08-01 20:55:56,594 [EE-ManagedScheduledExecutorService-default-Thread-2] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=7e15cec4-8ca3-11e8-a44e-3b9214e8fd3e, baseUri=https://10.3.0.16:8407/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:55:47.014} is unreachable longer than PT0S - will be evicted. 2018-08-01 20:56:14,948 [EE-ManagedScheduledExecutorService-default-Thread-1] TRACE c.i.p.d.e.ServiceRemovalTask$Remover - checking whether unreachable services should be evicted 2018-08-01 20:56:14,949 [EE-ManagedScheduledExecutorService-default-Thread-1] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=84da8402-8ca3-11e8-9781-5391e993bb7c, baseUri=https://10.3.0.16:8411/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:56:02.128} is unreachable longer than PT0S - will be evicted. 2018-08-01 20:56:23,357 [EE-ManagedScheduledExecutorService-default-Thread-1] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=88350992-8ca3-11e8-8f10-b7488d9640db, baseUri=https://10.3.0.16:8412/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:55:57.903} is unreachable longer than PT0S - will be evicted. 2018-08-01 20:56:23,537 [EE-ManagedScheduledExecutorService-default-Thread-1] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=8ef01c22-8ca3-11e8-b36c-1b4e30a5884d, baseUri=https://10.3.0.16:8422/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:55:56.997} is unreachable longer than PT0S - will be evicted. 2018-08-01 20:56:26,608 [EE-ManagedScheduledExecutorService-default-Thread-1] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=8b8a43dc-8ca3-11e8-9e02-17983f95fdb3, baseUri=https://10.3.0.16:8415/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:56:02.909} is unreachable longer than PT0S - will be evicted.

I am very curious about this behavior, do you know why ServiceRemovalTask work in this way?

(the log has been attached: https://drive.google.com/file/d/1ZebszPehvdtYXT4Hu-r8aS4aONG99_8_/view?usp=sharing)

RobertCMa commented 6 years ago

Hi @rwleea ,

We are checking the log and see what's the situation.

RobertCMa commented 6 years ago

Hi @rwleea ,

Firstly, The service/task of "PSME is marked as unreachable at 20:56:02 759" is independent with the ServiceRemovalTask at 20:56:14,948. The target will need to be marked as evicted and then removed. The exact removal of "https://10.3.0.16:8415/redfish/v1" starts at 20:56:26,608.

The removal task begins at checking if unreachable services should be marked:

2018-08-01 20:56:14,948 [EE-ManagedScheduledExecutorService-default-Thread-1] TRACE c.i.p.d.e.ServiceRemovalTask$Remover - checking whether unreachable services should be evicted <..skip..>

SERVICE(8b8a43dc-8ca3-11e8-9e02-17983f95fdb3) MARKED TO BE EVICTED:

2018-08-01 20:56:26,608 [EE-ManagedScheduledExecutorService-default-Thread-1] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=8b8a43dc-8ca3-11e8-9e02-17983f95fdb3, baseUri=https://10.3.0.16:8415/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:56:02.909} is unreachable longer than PT0S - will be evicted.

SERVICE(8b8a43dc-8ca3-11e8-9e02-17983f95fdb3) REMOAVL:

2018-08-01 20:56:26,609 [EE-ManagedScheduledExecutorService-default-Thread-1] DEBUG c.i.p.d.e.ScheduledDiscoveryManager - Discovery cancelled for service 8b8a43dc-8ca3-11e8-9e02-17983f95fdb3

PodM is a server and it run hundreds of threads in parallel. Flow of marking service as absent and removing it, is not synchronous.

rwleea commented 6 years ago

Hi @RobertCMa ,

I know ExternalServiceAvailabilityCheckerTask & ServiceRemovalTask are run in parallel.

PSME is marked as unreachable at 20:56:02 by ExternalServiceAvailabilityCheckerTask , because this task finished with no exception, so I thought all changes shall be flushed to the database. But 20:56:14,948 ServiceRemovalTask didn't determine this PSME shall be removed, is it possible that ServiceRemovalTask use stale or older version external_service data to determine which PSME shall be removed ?

RobertCMa commented 6 years ago

Hi @rwleea

For the service (8b8a43dc-8ca3-11e8-9e02-17983f95fdb3), it was marked as evicted at 20:56:26,608 and removed later. Before that, it can see some activities polling the service. In the meantime, the removal task is evaluating other service, like (84da8402-8ca3-11e8-9781-5391e993bb7c). For the tasks, there is no straight time line relation between them. PodM is a server and it run hundreds of threads in parallel. Flow of marking service as absent and removing it, is not synchronous.

2018-08-01 20:56:16,197 [EE-ManagedExecutorService-SSDPExecutor-Thread-13] DEBUG org.apache.http.wire - http-outgoing-921891 << "{"@odata.context":"/redfish/v1/$metadata#ServiceRoot.ServiceRoot","@odata.id":"/redfish/v1","@odata.type":"#ServiceRoot.v1_1_1.ServiceRoot","Id":"RootService","Name":"PSME Service Root","Description":"Service Root description","RedfishVersion":"1.1.0","UUID":"8b8a43dc-8ca3-11e8-9e02-17983f95fdb3","EventService":{"@odata.id":"/redfish/v1/EventService"},"UpdateService":{"@odata.id":"/redfish/v1/UpdateService"},"Tasks":{"@odata.id":"/redfish/v1/TaskService"},"Registries":{"@odata.id":"/redfish/v1/Registries"},"Chassis":{"@odata.id":"/redfish/v1/Chassis"},"Systems":{"@odata.id":"/redfish/v1/Systems"},"Managers":{"@odata.id":"/redfish/v1/Managers"},"Fabrics":{"@odata.id":"/redfish/v1/Fabrics"},"TelemetryService":{"@odata.id":"/redfish/v1/TelemetryService"},"StorageServices":{"@odata.id":"/redfish/v1/StorageServices"},"Oem":{"Intel_RackScale":{"@odata.type":"#Intel.Oem.ServiceRoot","ApiVersion":"2.2.0","EthernetSwitches":{"@odata.id":"/redfish/v1/EthernetSwitches"}}},"Links":{}}" 2018-08-01 20:56:16,197 [EE-ManagedExecutorService-SSDPExecutor-Thread-13] DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager - Connection [id: 921891][route: {s}->https://10.3.0.16:8415][state: CN=SMCIPMI, O=SMCI, L=Taipei, ST=Taiwan, C=TW] can be kept alive indefinitely

2018-08-01 20:56:16,198 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-8] INFO c.i.p.d.e.ServiceDetectionListenerImpl - Service {URI=https://10.3.0.16:8415/redfish/v1, type=PSME, UUID=8b8a43dc-8ca3-11e8-9e02-17983f95fdb3} detected

2018-08-01 20:56:25,802 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-6] INFO c.i.p.d.external.DiscoveryRunner - Polling data from ExternalService {UUID=8b8a43dc-8ca3-11e8-9e02-17983f95fdb3, baseUri=https://10.3.0.16:8415/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:56:02.909} finished

2018-08-01 20:56:26,608 [EE-ManagedScheduledExecutorService-default-Thread-1] INFO c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=8b8a43dc-8ca3-11e8-9e02-17983f95fdb3, baseUri=https://10.3.0.16:8415/redfish/v1, type=PSME, unreachableSince=2018-08-01T20:56:02.909} is unreachable longer than PT0S - will be evicted.

RobertCMa commented 6 years ago

Hi @rwleea,

As there is no activity for a while, will conclude this issue by end of the week. Thanks.

RobertCMa commented 6 years ago

Hi @rwleea ,

Conclude this issue as no activities for a while. New issues can be raised for the investigation. Thanks.