intel / SGXDataCenterAttestationPrimitives

Other
271 stars 165 forks source link

Error: unexpected error happend during sending data to cache server. #220

Open rranjan3 opened 2 years ago

rranjan3 commented 2 years ago

I have setup PCCS in k8s cluster on a non-SGX machine. Then I configure a SGX device to communicate to this PCCS service. I have also installed PCKIDRetrievalTool as mentioned here.

Things are all good and healthy with the setup and I am able to run the PCKIDRetrievalTool. But as more and more nodes are added with time, we start seeing the PCKIDRetrievalTool fail. Also the node which had succeeded earlier starts to fail with the following log -

ubuntu@myserver40:/opt/intel/sgx-pck-id-retrieval-tool$ sudo PCKIDRetrievalToolIntel(R) Software Guard Extensions PCK Cert ID Retrieval Tool Version 1.12.101.1
Warning: platform manifest is not available or current platform is not multi-package platform.
Error: unexpected error happend during sending data to cache server.
pckid_retrieval.csv has been generated successfully, however the data couldn't be sent to cache server!

Corresponding log seen on the PCCS is -

2022-01-06 05:27:32.536 [info]: Client Request-ID : de5abc9c21634101a833657761ffe127
2022-01-06 05:27:33.103 [info]: Request-ID is : 450d6aae91664f39ba589027383854f2
2022-01-06 05:27:33.106 [error]: Error: No cache data for this platform.
{{ at Proxy.getPckCertFromPCS (/opt/intel/pccs/services/logic/commonCacheLogic.js:86:11)}}
{{ at runMicrotasks (<anonymous>)}}
{{ at processTicksAndRejections (internal/process/task_queues.js:95:5)}}
{{ at async LazyCachingMode.registerPlatforms (/opt/intel/pccs/services/caching_modes/cachingMode.js:163:7)}}
{{ at async Proxy.registerPlatforms (/opt/intel/pccs/services/platformsRegService.js:107:3)}}
{{ at async postPlatforms (/opt/intel/pccs/controllers/platformsController.js:40:5)}}
2022-01-06 05:27:33.111 [info]: 10.0.0.249 - - [06/Jan/2022:05:27:33 +0000] "POST /sgx/certification/v3/platforms HTTP/1.1" 404 32 "" ""

Resetting the SGX on the device solves the issue at times and the PCKIDRetrievalTool can be run again. I do not understand what could happen with time (usually over 24-30 hours of the PCCS running) that causes this failure.

These are production machines and hence we are able to succeed with the appropriate API keys initially.

EDIT: Worth mentioning, I do see a restart of the PCCS service post which we get this observation. Its not very clear what causes the restart but its the AWS instance (the non-SGX machine mentioned above) that restarts & not the PCCS service alone.

jsun39 commented 2 years ago

What is your platform's type? ICX-SP platform? Have you done registration?

rranjan3 commented 2 years ago

Yes. this

rranjan3 commented 2 years ago

What is your platform's type? ICX-SP platform? Have you done registration?

What registration ? Getting the APIKey you meam?

jsun39 commented 2 years ago

we have synced in Teams, when you have any result, please update this thread

rranjan3 commented 2 years ago

For this instance, I tried a redeploy of PCCS with an updated password (12 chars, earlier it was 8 chars). This has worked for me. Just to give some background, we suspected the PCCS service here as the same SGX device worked with another known good PCCS server. Observing the system to see if after all there is a time factor.

rranjan3 commented 2 years ago

Observed the behavior on another PCCS service now. Earlier we had suspected that there could have been some network issue that could have been causing this. But this hypothesis seems negated as I am able to run the PCKIDRetrievalTool from another SGX machine with the same PCCS which is throwing 404 for one SGX machine.

Log for machine 1 -

2022-03-21 08:32:10.347 [info]: Client Request-ID : 148260fddb9f4b53a1132c557c47bb52
2022-03-21 08:32:11.518 [info]: Request-ID is : 9f8af4449ea64e879d9f1e8749ef83e9
2022-03-21 08:32:11.568 [info]: Request-ID is : d98dc665d92146b590f8dd64506321e8
2022-03-21 08:32:11.647 [info]: Request-ID is : ba7d47a0cc664807adb2a5abed43d3ed
2022-03-21 08:32:11.710 [info]: Request-ID is : fbc4862df75b4fa099b222409e160858
2022-03-21 08:32:11.730 [info]: 10.0.0.10 - - [21/Mar/2022:08:32:11 +0000] "POST /sgx/certification/v3/platforms HTTP/1.1" 200 21 "-" "-"

Log for machine 2 (which had been running fine but causing 404 now) -

2022-03-21 08:32:44.517 [info]: Client Request-ID : cee8916bfc784794b582099ca5002918
2022-03-21 08:32:44.652 [info]: Request-ID is : 69651c7016304234b5686065d82f827a
2022-03-21 08:32:44.652 [error]: Error: No cache data for this platform.
    at Proxy.getPckCertFromPCS (/opt/intel/pccs/services/logic/commonCacheLogic.js:86:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async LazyCachingMode.registerPlatforms (/opt/intel/pccs/services/caching_modes/cachingMode.js:163:7)
    at async Proxy.registerPlatforms (/opt/intel/pccs/services/platformsRegService.js:107:3)
    at async postPlatforms (/opt/intel/pccs/controllers/platformsController.js:40:5)
2022-03-21 08:32:44.653 [info]: 10.0.0.10 - - [21/Mar/2022:08:32:44 +0000] "POST /sgx/certification/v3/platforms HTTP/1.1" 404 32 "-" "-"

Second attempt log for machine 1 -

2022-03-21 08:33:03.715 [info]: Client Request-ID : e211b4cf93914d81b8d0e0ad9ac9f5e4
2022-03-21 08:33:04.596 [info]: Request-ID is : e05f1256224d40fd870761a5a2262029
2022-03-21 08:33:04.642 [info]: Request-ID is : d98dc665d92146b590f8dd64506321e8
2022-03-21 08:33:04.678 [info]: 10.0.0.10 - - [21/Mar/2022:08:33:04 +0000] "POST /sgx/certification/v3/platforms HTTP/1.1" 200 21 "-" "-"

Second attempt log from machine 2 -

2022-03-21 08:33:30.031 [info]: Client Request-ID : c3b85a2555f44f61877cb71916ea8bc2
2022-03-21 08:33:30.218 [info]: Request-ID is : bac15237cb19483e83a708566095fb56
2022-03-21 08:33:30.218 [error]: Error: No cache data for this platform.
    at Proxy.getPckCertFromPCS (/opt/intel/pccs/services/logic/commonCacheLogic.js:86:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async LazyCachingMode.registerPlatforms (/opt/intel/pccs/services/caching_modes/cachingMode.js:163:7)
    at async Proxy.registerPlatforms (/opt/intel/pccs/services/platformsRegService.js:107:3)
    at async postPlatforms (/opt/intel/pccs/controllers/platformsController.js:40:5)
2022-03-21 08:33:30.219 [info]: 10.0.0.10 - - [21/Mar/2022:08:33:30 +0000] "POST /sgx/certification/v3/platforms HTTP/1.1" 404 32 "-" "-"

Some other tool from machine 2 calling which again causes 404 -

2022-03-21 08:33:52.639 [info]: Client Request-ID : 9ac26005a6e24f1ca655a369ed5cdfb3
2022-03-21 08:33:52.763 [info]: Request-ID is : a6f96046d7194d1c83a96ef5f92832ba
2022-03-21 08:33:52.764 [error]: Error: No cache data for this platform.
    at Proxy.getPckCertFromPCS (/opt/intel/pccs/services/logic/commonCacheLogic.js:86:11)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)
    at async LazyCachingMode.getPckCertFromPCS (/opt/intel/pccs/services/caching_modes/cachingMode.js:126:12)
    at async Proxy.getPckCert (/opt/intel/pccs/services/pckcertService.js:115:16)
    at async getPckCert (/opt/intel/pccs/controllers/pckcertController.js:77:25)
2022-03-21 08:33:52.765 [info]: 10.0.0.10 - - [21/Mar/2022:08:33:52 +0000] "GET /sgx/certification/v3/pckcert?qeid=93BED1E824DC21EC83D3E1E65577233E&encrypted_ppid=497D243C1DEE14D8EF1418C02093664A9D1893EA80B9251E8367FF4E980971CC7CF55006256A2B5D0682BD07B6F1A5371394494ED05EAC72EA7D35EB6409EE42F1FC20E6E6A5A21674FF877E6B7AA210E299297DDA0DA5235A95788B5D7F07F362BBA5DF1BC9660B49A5B8E290B6C47F71D307E253C299576D334A825CBE6E4F9C6B3CE661E821BDD19E81A5944FB54FDEC6D5FFD4CE7B186B357E3ACD85F25F84A6C87B0418CC95ADF19BE8CB89CBBF74288D7EECEC8D3BEDC8F3A742265143EF0B969B49CC96B278AB9C8DD136DA2AE9826D1EA0F5725ED100255C686845223A5ABD7D6976F33DC1705802094329E69A6A8050B8D62ACC2606C087B9E690D6D55A5568748537EB33BA0B724805EFC2A8BD1F0D84B91965509F083900DC275C961A8717759856B672C26D65B9D34C735305147D17F27558F6DF4A2495A751C7177A82331A9E1191A1BF773020AB1961E25ABF6C4D1F2689B8C1760B0963CF94BAA539B19FE83FA207DD0C92D78B377C200FCC99BC8D1BE0B2774A7D89B33F2F&cpusvn=0505090AFFFF00000000000000000000&pcesvn=0C00&pceid=0000 HTTP/1.1" 404 32 "-" "-"

Will attempt to connect machine 2 to another known good pccs service and update this thread.

nbj12138 commented 6 months ago

请问你解决这个问题了吗?我也出现了同样的问题