intel / intelRSD

Intel® Rack Scale Design Reference Software
http://intel.com/IntelRSD
101 stars 55 forks source link

Connection error while getting data from ExternalService #97

Open chendave opened 6 years ago

chendave commented 6 years ago

Dear developers,

There is a time PODM tells me computer systems are in "InTest" state, and I am unable to do node composition, so I lookup the state from "pod-manager-user-guide-v2-1.pdf", and then execute the below command,

$ sudo /usr/bin/pod-manager-clean-database-on-next-startup

and then restart pod-manager service,

$ sudo systemctl restart pod-manager

But now I cannot get anything computer systems back, $ curl -k -u admin:admin https://10.3.0.1:8443/redfish/v1/Systems { "@odata.context" : "/redfish/v1/$metadata#Systems", "@odata.id" : "/redfish/v1/Systems", "@odata.type" : "#ComputerSystemCollection.ComputerSystemCollection", "Name" : "Computer System Collection", "Description" : "Computer System Collection", "Members@odata.count" : 0, "Members" : [ ] }

/var/log/pod-manager/pod-manager-application.log give some hint on such abnormal behavior, ... WARN c.i.p.d.external.DiscoveryRunner - Connection error while getting data from ExternalService {UUID=4c4c4544-434d-1001-8000-d0946609a764, baseUri=http://10.3.2.248:80/redfish/v1, type=PSME, unreachableSince=2018-10-17T01:27:45.322} service - performing check on this service 2018-10-17 02:22:41,120 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-5] DEBUG c.i.p.d.e.ExternalServiceAvailabilityCheckerTask - Verifying service with UUID 4c4c4544-434d-1001-8000-d0946609a764 2018-10-17 02:22:41,783 [EE-ManagedScheduledExecutorService-TasksExecutor-Thread-5] DEBUG c.i.p.d.e.ExternalServiceAvailabilityCheckerTask - Service ExternalService {UUID=4c4c4544-434d-1001-8000-d0946609a764, baseUri=http://10.3.2.248:80/redfish/v1, type=PSME, unreachableSince=2018-10-17T01:27:45.322} still exists ...

But the network and PSME service is good, I can connect and get the system back when I call it directly, $ curl http://10.3.2.248:80/redfish/v1/Systems { "@odata.context": "/redfish/v1/$metadata#Systems", "@odata.id": "/redfish/v1/Systems", "@odata.type": "#ComputerSystemCollection.ComputerSystemCollection", "Name": "Computer System Collection", "Members@odata.count": 5, "Members": [ { "@odata.id": "/redfish/v1/Systems/Rack1-Block2-Sled2-Node1" }, { "@odata.id": "/redfish/v1/Systems/Rack1-Block2-Sled4-Node1" }, { "@odata.id": "/redfish/v1/Systems/Rack1-Block3-Sled1-Node1" }, { "@odata.id": "/redfish/v1/Systems/Rack1-Block3-Sled2-Node1" }, { "@odata.id": "/redfish/v1/Systems/Rack1-Block3-Sled3-Node1" } ] }

I found there is a similar issue here: https://github.com/intel/intelRSD/issues/58, and looks like this is related with service UUID, how can purge all those data and poll everything again? Is there any configuration item I need to update to fix the issue? what's the root cause for this issue?

Thanks a lot for any input!

pod-manager-application.log

chendave commented 6 years ago

BTW, the POMD version I am using is 2.1