gramineproject / graphene

Graphene / Graphene-SGX - a library OS for Linux multi-process applications, with Intel SGX support
https://grapheneproject.io
GNU Lesser General Public License v3.0
771 stars 261 forks source link

IPC pid release failed error followed by Illegal instruction during Graphene internal execution seen with LTP tests #2504

Closed jinengandhi-intel closed 3 years ago

jinengandhi-intel commented 3 years ago

Description of the problem

For random tests we are seeing the following errors after the tests have finished and the return code for the test is 0. <system-err>error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production! [P19048:T2:fcntl12] error: IPC pid release failed [P19048:T2:fcntl12] error: Illegal instruction during Graphene internal execution at 0x7fd5702dc55a (IP = +0xc55a, VMID = 19048, TID = 2) </system-err>

This is seen in the opensource CI as well as the Intel internal local CI but was missed as the errors are reported in system-err block which is not parsed.

https://localhost:8080/job/graphene-18.04/5903/artifact/LibOS/shim/test/ltp/ltp.xml https://localhost:8080/job/graphene-18.04/5911/artifact/LibOS/shim/test/ltp/ltp.xml

<system-out>fcntl12 0 TINFO : Test for errno EMFILE fcntl12 1 TPASS : block 1 PASSED </system-out>
<system-err>error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production! [P19001:T2:fcntl12] error: IPC pid release failed [P19001:T2:fcntl12] error: Illegal instruction during Graphene internal execution at 0x7fee5665955a (IP = +0xc55a, VMID = 19001, TID = 2) </system-err>
<properties>
<property name="returncode" value="0"/>

Steps to reproduce

Run fnctl12, fcntl12_64, waitpid03, sendto01 or waitpid04 LTP test in Graphene Native mode.

dimakuv commented 3 years ago

Looks like an issue for @boryspoplawski

jinengandhi-intel commented 3 years ago

Went back and checked the nightly results in our local CI, and no such failures were seen for our June 29 nightly i.e. until commit [LibOS] Rework IDs management (205cbe0123978b7b178c413a172b0be38658ef55) but something that started with June 30 nightly run, so most probably it could be one of the 2 commits from Borys:

[LibOS] Send PID alongside TID in tgkill IPC message … [LibOS] Do not remove IPC connection on errors …

jinengandhi-intel commented 3 years ago

In our internal CI, I also saw connect01 fail once

`

connect01 1 TPASS : bad file descriptor successful connect01 2 TPASS : invalid socket buffer successful connect01 3 TPASS : invalid salen successful connect01 4 TPASS : invalid socket successful connect01 5 TPASS : already connected successful connect01 6 TPASS : connection refused successful connect01 7 TPASS : invalid address family successful
<system-err>error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!

[P15230:T2:connect01] error: Sending IPC process-exit notification failed: -13 [P15230:T2:connect01] error: IPC pid release failed [P15230:T2:connect01] error: Illegal instruction during Graphene internal execution at 0x7f87fb4c755a (IP = +0xc55a, VMID = 15230, TID = 2) `

boryspoplawski commented 3 years ago

@jinengandhi-intel please check #2508 it should fix the issue.

Illegal instruction during Graphene internal execution was just due to die_or_inf_loop which does ud2. The general problem is that IPC leader does not wait for all subprocesses to finish, but #2508 should fix the problem temporarily.

mkow commented 3 years ago

@jinengandhi-intel please format your comments better, the logs in the one above are impossible to read :/

jinengandhi-intel commented 3 years ago

Since the log level for the message has been changed from error to warning, I don't see it in the default setting but do see in when I change the log_level to trace, but there are no side-effects of this. As this is just a workaround, I would suggest we keep the issue open as it is a legitimate error which needs to be resolved and don't want this temp fix to mask the issue.

boryspoplawski commented 3 years ago

The actual issue is tracked in #2514