intel / intelRSD

Intel® Rack Scale Design Reference Software
http://intel.com/IntelRSD
101 stars 55 forks source link

Pod-manager cannot detect PSME #58

Closed rwleea closed 6 years ago

rwleea commented 6 years ago

Dear Intel,

I found an issue that is pod-manager can not detect PSME after restating PSME.

steps:

  1. Kill PSME restful server & compute/chassis agent
  2. Start PSME restful server & compute/chassis agent

After I perform those steps, my expectation is I can see this PSME in external_service, but actually I can not found it, so I check the log:

Log:

Thread-1 (Discovery job) 2018-04-07 23:47:22,404 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] INFO c.i.p.d.external.DiscoveryRunner - Polling data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME} started 2018-04-07 23:47:22,821 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] WARN c.i.p.d.external.DiscoveryRunner - Connection error while getting data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME} service - performing check on this service


Thread-8 (Service removal job)

2018-04-07 23:48:01,162 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-8] INFO c.i.p.d.external.ServiceRemovalTask[ServiceLifecycle] - ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME, unreachableSince=2018-04-07T23:47:23.235} is unreachable longer than PT0S - will be evicted.


Thread-3 (Service detect job)

2018-04-07 23:48:02,077 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-3] DEBUG c.i.p.d.e.ServiceDetectionListenerImpl - Service Detect ..Schedule deep disvocery for 52dc9db9-784a-4857-9c7c-3fab259fb13f 2018-04-07 23:48:02,078 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-3] DEBUG c.i.p.c.s.SerialExecutorImpl - Registering new async task for 52dc9db9-784a-4857-9c7c-3fab259fb13f 2018-04-07 23:48:02,198 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-3] INFO c.i.p.d.e.event.EventSubscribeRunner - Successfully subscribed to event service for '52dc9db9-784a-4857-9c7c-3fab259fb13f' 2018-04-07 23:48:02,200 [ForkJoinPool.commonPool-worker-1] DEBUG c.i.p.c.s.SerialExecutorImpl - Requesting execution of next ASYNC operation(DiscoveryRunner(52dc9db9-784a-4857-9c7c-3fab259fb13f)) for (52dc9db9-784a-4857-9c7c-3fab259fb13f)


Thread-7 (Discovery Job)

2018-04-07 23:48:02,216 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-7] INFO c.i.p.d.external.DiscoveryRunner - Polling data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME, unreachableSince=2018-04-07T23:47:23.235} started 2018-04-07 23:48:02,317 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-7] DEBUG c.i.p.c.e.u.l.MethodInvocationLoggingInterceptor - [Discovery] RestGraphBuilderImpl.build - started 2018-04-07 23:48:02,378 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-7] DEBUG c.i.p.c.e.u.l.MethodInvocationLoggingInterceptor - [Discovery] RestGraphBuilderImpl.build - ended, execution time: 60.98 ms 2018-04-07 23:48:05,381 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-7] DEBUG c.i.p.c.e.u.l.MethodInvocationLoggingInterceptor - [Discovery] EntityGraphMapper.map - started 2018-04-07 23:48:05,396 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-7] ERROR c.i.p.d.external.DiscoveryRunner - Error while polling data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME, unreachableSince=2018-04-07T23:47:23.235} javax.ejb.EJBTransactionRolledbackException: there is no service with UUID '52dc9db9-784a-4857-9c7c-3fab259fb13f'


I think the sequence is like this:

  1. Thread 1 : pod-manager cannot connect to PSME (because I kill it)

  2. Thread 8 : Service removal task prepare to remove PSME from external_service table

  3. Thread 3 : service detect (because I start PSME) * pod-manager got the old external_service data which thread 8 prepare to delete (you can observe it from 2018-04-07 23:48:02,216 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-7] INFO c.i.p.d.external.DiscoveryRunner - Polling data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME, unreachableSince=2018-04-07T23:47:23.235**} started, why new detect service has unreachableSince ?)

  4. Thread 7: pod-manager start to polling the PSME …. (Thread 8)remove the PSME from external_service table pod-manager throws exception: there is no service with UUID '52dc9db9-784a-4857-9c7c-3fab259fb13f' --> because the tuple in external_service is removed by Thread 8

And here is my assumption about why pod-manager can not persist new detect PSME to its database, even though this PSME is keep running.

1) Pod-manager service detect thread read the old external_service data and then the old data are deleted by service removal thread. This result in service detect fail and the PSME doesn’t enter to external_service table. 2) PSME is running and keep to send alive message, even though PSME doesn’t enter to external_service table, it is still in SSDP registry cache. When pod-manager receive m-search response or notification from this PSME, pod-manager will think this PSME is known service (due to it is already in SSDP registry cache), and never fire service detect task.

Could you please help to check this issue?

RobertCMa commented 6 years ago

Hi @rwleea,

Some questions will need your clarification.

1/ Please elaborate more about the procedure, like commands to kill/start PSME rest server & compute/chassis agent. The detail info will help us understand the procedure and sequence. 2/ Will the PSME never be detected after your steps? Or, will it be detected after a long while ( hours or ...)? Which discovery mechanism is adopted for PSME, DHCP, SSDP or both? Will be helpful to share the configuration in PSME rest server. 3/ Please help collect the PODM health log by proceed the command of "sudo pod-manager-healthcheck", which will generated one tar file of log.

rwleea commented 6 years ago

Hi Robert,

Thanks for your helping, for your question, please see my reply inline.

1/ Please elaborate more about the procedure, like commands to kill/start PSME rest server & compute/chassis agent. The detail info will help us understand the procedure and sequence.

-->Our PSME runs on LXC (container). I restart container which are running PSME using lxc restart conrainer , after starting container , a python script would be executed automatically:

/usr/local/bin/psme-rest-server /usr/local/etc/psme-config/server.json
/usr/local/bin/psme-chassis /usr/local/etc/psme-config/chassis.json
/usr/local/bin/psme-compute /usr/local/etc/psme-config/compute.json

2/ Will the PSME never be detected after your steps? Or, will it be detected after a long while ( hours or ...)? Which discovery mechanism is adopted for PSME, DHCP, SSDP or both? Will be helpful to share the configuration in PSME rest server.

--> Yes, PSME never be detected (I already wait for hours) since it already in SSDP registry, so I restart PSME to let PSME send byebye message, after that PODM can detect PSME.

-->We only adopt SSDP to discover PSME.

--> PSME rest server configuration :


{
    "commands": {
        "generic": "Registration"
    },
    "event-service": {
        "delivery-retry-attempts": 3,
        "delivery-retry-interval-seconds": 60
    },
    "eventing": {
        "address": "localhost",
        "poll-interval-sec": 20,
        "port": 5504
    },
    "logger": {
        "app": {
            "color": true,
            "level": "INFO",
            "moredebug": false,
            "output": true,
            "streams": [
                {
                    "file": "/var/log/psme/psme-rest-004.log",
                    "type": "FILE"
                }
            ],
            "tagging": true,
            "timeformat": "DATE_NS"
        }
    },
    "registration": {
        "minDelay": 3,
        "port": 8304
    },
    "rest": {
        "service-root-name": "PSME Service Root"
    },
    "retention-policy": {
        "interval-sec": 600,
        "outdated-sec": 2419200
    },
    "rmm-present": false,
    "server": {
        "connectors": [
            {
                "certs-directory": "/etc/psme/certs",
                "client-cert-required": true,
                "port": 8404,
                "thread-mode": "select",
                "use-ssl": true
            },
            {
                "port": 8804,
                "redirect-port": 8404,
                "thread-mode": "select",
                "thread-pool-size": 1,
                "use-ssl": false
            }
        ],
        "network-interface-name": "ethmgmt"
    },
    "service-uuid-file": "/usr/local/etc/psme-uuid//rest-server-service-uuid-004.json",
    "ssdp-service": {
        "announce-interval-seconds": 10,
        "enabled": true,
        "ttl": 2
    },
    "subscription-config-file": "/tmp/subscriptions",
    "timeout": 100000
}

3/ Please help collect the PODM health log by proceed the command of "sudo pod-manager-healthcheck", which will generated one tar file of log.

--> https://www.dropbox.com/s/lg79xvpzjn3pvqz/pod_manager_health_check_2018-04-12-07-03-36-136.tar.gz?dl=0

RobertCMa commented 6 years ago

Hi @rwleea,

Unfortunately some logs had been overwritten and not contained the occurrence time frame, like ssdp. We are still checking and see if there is any hint left. It will be great if you can reproduce it again and capture the health log right away.

To clarify the container usage, are all the PSME rest servers (10.3.0.87/92/93/94/113/etc.....) virtualized from the same system? Confuse why they share the same UUID.

In the meanwhile, you can adjust the Retention policy as "0", which is mentioned in PODM user guide session 3.6, and see if the asset discovery will be updated in this case.

rwleea commented 6 years ago

Hi Robert,

Sorry for late reply. this case happen occasionally, I used to reproduce it, but I cannot.

I think there may exists two problems :

  1. Depends on how much time PODM spend to cancel related tasks, ServiceRemoval task may take a little time to remove PSME from external service, but in another thread ServiceDetectionListener detect the same PSME and use the old data in external_table. Finally ServiceRemoval remove this old data from external_service table result in javax.ejb.EJBTransactionRolledbackException
public class ServiceRemovalTask implements Runnable {
//...
    public void run() {
        logger.t("checking whether unreachable services should be evicted");

        Duration howLongCanBeRetained = config.get().getServiceRemovalDelay();

        List<ExternalService> unreachableServices = services.getAllUnreachableLongerThan(howLongCanBeRetained);

        for (ExternalService service : unreachableServices) {
            logger.lifecycleInfo(
                    "{} is unreachable longer than {} - will be evicted.", service, howLongCanBeRetained
                );
            discoveryScheduler.cancel(service.getUuid());
            //ServiceDetectionListenerImpl  service detect...
            //DiscoveryRunner  start to polling
            services.remove(service);
            //DiscoveryRunner  throw transaction rollback exception due to missing external_service data
        }
    }
  1. Data in external_table is inconsistent with data in SSDP registry When EJBTransactionRolledbackException, PODM doesn't remove this PSME from SSDP registry but this PSME is removed from external_service table by ServiceRemoval task.
//remove PSME from SSDP registry
2018-04-07 23:47:22,821 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-1] WARN c.i.p.d.external.DiscoveryRunner - Connection error while getting data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME} service - performing check on this service

//add PSME  to SSDP registry / external_service table
2018-04-07 23:48:02,077 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-3] DEBUG c.i.p.d.e.ServiceDetectionListenerImpl - Service Detect ..Schedule deep disvocery for 52dc9db9-784a-4857-9c7c-3fab259fb13f

// remove PSME from external_service table (still in SSDP registry)
c.i.p.d.external.DiscoveryRunner - Error while polling data from ExternalService {UUID=52dc9db9-784a-4857-9c7c-3fab259fb13f, baseUri=https://10.3.0.113:8407/redfish/v1, type=PSME, unreachableSince=2018-04-07T23:47:23.235}
javax.ejb.EJBTransactionRolledbackException: there is no service with UUID '52dc9db9-784a-4857-9c7c-3fab259fb13f'

Could you please help to check those problems?

RobertCMa commented 6 years ago

Hi @rwleea,

The service detected/removed/verify will look up the uuid in the databse to identify the service entity. As the last confusion of many systems using the same UUID in the environment, it can not be predicted that if the database is made up successfully. Like the warn/error log, it's concerned if the data is correct and in-sync. To further isolate the syndrome, it's suggested to rule out the situation and reproduce the syndrome again. Thanks.

Connection error while getting data from ExternalService Error while polling data from ExternalService

rwleea commented 6 years ago

Hi Robert,

We also don't allow multiple PSME share the same Service UUID.

In our environment PSME run in the container, and user can recreate container. We have recorded PSME Service UUID associated its location in config file, If user recreate container which PSME running in, then after creating container we will use the same UUID to start the same PSME service.

so you will see many PSME with different URI use the same UUID, due to recreate container , PSME IP is changed.

{UUID=adec2482-b6d7-4a6b-8bc4-7080d06b43c7, | baseUri=https://10.3.0.87:8424/redfish/v1, | type=PSME} {UUID=adec2482-b6d7-4a6b-8bc4-7080d06b43c7, | baseUri=https://10.3.0.117:8424/redfish/v1, | type=PSME} {UUID=adec2482-b6d7-4a6b-8bc4-7080d06b43c7, | baseUri=https://10.3.0.113:8424/redfish/v1, | type=PSME}

rwleea commented 6 years ago

Hi Robert,

This issue happen again in my environment: User execute restart container, this behavior is equivalent to :

/usr/local/bin/psme-rest-server /usr/local/etc/psme-config/server.json /usr/local/bin/psme-chassis /usr/local/etc/psme-config/chassis.json /usr/local/bin/psme-compute /usr/local/etc/psme-config/compute.json

From log, it seems like concurrency issue.

2018-05-04 02:57:01,997 [EE-ManagedExecutorService-SSDPExecutor-Thread-18] DEBUG c.i.p.s.d.ssdp.SsdpRegistryObserver - SSDP resource removed: (RemoteDevice) Identity: (RemoteDeviceIdentity) UDN: uuid:507da848-ef5c-41d6-b4e3-80dc44d0a32f, Descriptor: https://10.3.0.3:8415/redfish/v1, Root: true

2018-05-04 02:57:01,998 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-5] DEBUG c.i.p.d.e.ExternalServiceAvailabilityCheckerTask - Verifying service with UUID 507da848-ef5c-41d6-b4e3-80dc44d0a32f

2018-05-04 02:57:09,312 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-5] INFO  c.i.p.d.e.ExternalServiceAvailabilityCheckerTask[ServiceLifecycle] - Service ExternalService {UUID=507da848-ef5c-41d6-b4e3-80dc44d0a32f, baseUri=https://10.3.0.3:8415/redfish/v1, type=PSME} has been set to ABSENT state. Scheduling removal of service after PT0S

2018-05-04 02:57:11,442 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-8] INFO  c.i.p.d.e.ServiceRemovalTask$Remover[ServiceLifecycle] - ExternalService {UUID=507da848-ef5c-41d6-b4e3-80dc44d0a32f, baseUri=https://10.3.0.3:8415/redfish/v1, type=PSME, unreachableSince=2018-05-04T02:57:09.424} is unreachable longer than PT0S - will be evicted. //start to cancel tasks & remove external_service

2018-05-04 02:57:11,860 [EE-ManagedExecutorService-SSDPExecutor-Thread-18] DEBUG c.i.p.s.d.ssdp.SsdpRegistryObserver - SSDP resource detected: (RemoteDevice) Identity: (RemoteDeviceIdentity) UDN: uuid:507da848-ef5c-41d6-b4e3-80dc44d0a32f, Descriptor: https://10.3.0.3:8415/redfish/v1, Root: true

2018-05-04 02:57:11,860 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-3] INFO  c.i.p.d.e.ServiceDetectionListenerImpl - Service {URI=https://10.3.0.3:8415/redfish/v1, type=PSME, UUID=507da848-ef5c-41d6-b4e3-80dc44d0a32f} detected //find external service (do not need to create a new one)

2018-05-04 02:57:11,954 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-2] INFO  c.i.p.d.external.DiscoveryRunner - Polling data from ExternalService {UUID=507da848-ef5c-41d6-b4e3-80dc44d0a32f, baseUri=https://10.3.0.3:8415/redfish/v1, type=PSME, unreachableSince=2018-05-04T02:57:09.424} started

2018-05-04 02:57:12,058 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-2] ERROR c.i.p.d.external.DiscoveryRunner - Error while polling data from ExternalService {UUID=507da848-ef5c-41d6-b4e3-80dc44d0a32f, baseUri=https://10.3.0.3:8415/redfish/v1, type=PSME, unreachableSince=2018-05-04T02:57:09.424}
javax.ejb.EJBTransactionRolledbackException: there is no service with UUID '507da848-ef5c-41d6-b4e3-80dc44d0a32f'//external_service has been removed by ServiceRemovalTask(TasksExecutor-Thread-8)

You can found all related log from : https://www.dropbox.com/s/mn8e87l05gj5xv8/podm_log.tar.gz?dl=0

Thanks for your help!

RobertCMa commented 6 years ago

Hi @rwleea,

Something is confusing me and will like to have your help clarify. What's the RSD version adopted in this environment (pod_manager_health_check_2018-05-04-08-26-00-455) ? Pleaes help identify the PODM and PSME version info.

From this log, the database seems running into issues and it can not identify if the data stay in-sync.

// The last log in postgresql stops @ 018-05-04 02:54:56.648 and it does not see the database server come back 2018-05-04 02:54:56.648 UTC [podm] [127.0.0.1] [42883]STATEMENT: select endpoint0_.id as id225, endpoint01.event_source_context as event_so325, endpoint01.version as version425, endpoint01.description as descript525, endpoint01.global_id as global_i625, endpoint01.is_complementary as is_compl725, endpoint01.name as name825, endpoint01.status as status925, endpoint0_.endpoint_authentication_password as endpoint132, endpoint0_.endpoint_authentication_username as endpoint232, endpoint0_.composed_node_id as compose1132, endpoint0_.computer_system_id as compute1232, endpoint0_.entity_id as entity_i332, endpoint0_.fabric_id as fabric_1332, endpoint0_.host_reservation_memory_bytes as host_res432, endpoint0_.endpoint_metadata_id as endpoin1432, endpoint0_.device_id as device_i532, endpoint0_.subsystem_id as subsyste632, endpoint0_.subsystem_vendor_id as subsyste732, endpoint0_.vendor_id as vendor_i832, endpoint0_.processor_id as process1532, endpoint0_.endpoint_protocol as endpoint932, endpoint0_.storage_service_id as storage1632, endpoint0_.zone_id as zone_id1732 from endpoint endpoint0_ inner join discoverable_entity endpoint01 on endpoint0_.id=endpoint01.id inner join endpointidentifier identifier1 on endpoint0.id=identifier1.endpointid where identifier1.durable_name like ('%'||$1||'%')

// Discovery task met the problem retrieving the data from database 2018-05-04 02:54:56,654 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-3] ERROR c.i.p.d.external.DiscoveryRunner - Error while polling data from ExternalService {UUID=507da848-ef5c-41d6-b4e3-80dc44d0a32f, baseUri=https://10.3.0.3:8415/redfish/v1, type=PSME}