BUG() triggered during vfork and clone

llly commented 3 years ago

I countered an BUG() whem migrating Java program from old Graphene version to latest 1.2 rc. It can reproduce every time. https://github.com/oscarlab/graphene/blob/f7aa86ed3ae4bbb99480d16bab98c87625b778c9/LibOS/shim/src/ipc/shim_ipc_pid.c#L150-L153 There are several fork() called in the Java program to run another system command. Other forks succeed. I cannot root cause it.

Here is the log:

[P4346:T2:java] trace: ---- shim_pipe([261, 262]) = 0x0
[P4346:T2:java] debug: Creating pipe: pipe.srv:cdeeee9b1b1809d141b4eb2a2f44a479e3205028ae9c22b74ebc17f5a913e5e6
debug: sock_getopt (fd = 290, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 297, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 298, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
[P4346:T2:java] trace: ---- shim_pipe([263, 264]) = 0x0
[P4346:T2:java] debug: Creating pipe: pipe.srv:dad74dd71d9053c6de1d92150d48349c3ce1e0b55d36ddb0e56c5e50654b5b9b
debug: sock_getopt (fd = 290, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 299, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
debug: sock_getopt (fd = 358, sockopt addr = 0x7ff040a55f70) is not implemented and always returns 0
[P4346:T2:java] trace: ---- shim_pipe([265, 266]) = 0x0
[P4346:T2:java] trace: ---- shim_vfork() ...
[P4346:T2:java] warning: vfork was called by the application, implemented as an alias to fork in Graphene
......
[P4346:T2:java] debug: complete checkpointing data
[P4346:T2:java] debug: checkpoint of 1531864 bytes created
......
[P5015::] debug: restored memory from checkpoint
[P5015::] debug: receiving 62 PAL handles
[P5015::] debug: restoring checkpoint at 0xba2000000 rebased from 0xba2000000
[P5015:T524:java] debug: successfully restored checkpoint at 0xba2000000 - 0xba2175fd8
[P5015:T524:java] debug: Creating pipe: pipe.srv:5015
debug: sock_getopt (fd = 368, sockopt addr = 0x7ffdbc7d6ca0) is not implemented and always returns 0
debug: sock_getopt (fd = 369, sockopt addr = 0x7ffdbc7d6ca0) is not implemented and always returns 0
[P5015:shim] debug: IPC worker started
debug: sock_getopt (fd = 288, sockopt addr = 0x7ff042096f70) is not implemented and always returns 0
......
[P5015:T524:java] debug: ipc_change_id_owner: sending a request (524..5015)
[P5015:T524:java] debug: Sending ipc message to 4346
[P5015:T524:java] debug: Waiting for a response to 1
[P4346:shim] debug: IPC worker: received IPC message from 5015: code=4 size=42 seq=1
[P4346:shim] debug: ID 524 unknown!
[P4346:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P4346:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f70b3d (IP = +0x29b3d, VMID = 4346, TID = 0)
debug: DkProcessExit: Returning exit code 1

dimakuv commented 3 years ago

@llly Could you try it on the latest Graphene? After the v1.2-rc release, we had a bunch of bug fixes and small changes to the IPC (Inter-Process Communication) subsystem in Graphene.

In particular, when you move to the latest Graphene, you will notice more meaningful Process IDs (instead of your current P4346, P5015 there will be more meaningful P1 for main process, P2 for the first child process, etc).

Hopefully this bug will disappear after you update. If not, please post a similar debug log for the updated run.

@boryspoplawski Please keep an eye on this too. Also, looks like this message:

debug: ipc_change_id_owner: sending a request (524..5015)

should actually be:

debug: ipc_change_id_owner: sending a request (524 -> 5015)

(Meaning that TID 524 should get a new owner with PID 5015.)

llly commented 3 years ago

@dimakuv I can reproduce this failure on latest Graphene trunk. This time it's caused by clone a process at the very beginning of Python program.

[P2:T21:python] trace: ---- shim_clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0x0, 0x0, 0xcf740ca10, 0x0) ...
[P2:T21:python] debug: ipc_get_new_vmid: sending a request
[P2:T21:python] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: sending a request: [33..36]
[P2:T21:python] debug: Waiting for a response to 3
[P2:shim] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: ipc_send_message: 0
[P1:shim] debug: IPC worker: received IPC message from 2: code=1 size=34 seq=3
[P1:shim] debug: ipc_get_new_vmid_callback: 3
[P1:shim] debug: Sending ipc message to 2
[P1:shim] debug: IPC worker: received IPC message from 2: code=4 size=42 seq=0
[P2:shim] debug: IPC worker: received IPC message from 1: code=0 size=38 seq=3
[P1:shim] debug: ipc_release_id_range_callback: release_id_range(33..36)
[P2:T21:python] debug: Waiting finished: 0
[P2:shim] debug: Got an IPC response from 1, seq: 3
[P2:T21:python] debug: ipc_get_new_vmid: got a response: 3
......
[P2:T21:python] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P2:T21:python] debug: complete checkpointing data
[P2:T21:python] debug: checkpoint of 178440 bytes created
......
[P3:T37:python] debug: successfully restored checkpoint at 0xfb89fe000 - 0xfb8a29908
[P3:T37:python] debug: Creating pipe: pipe.srv:3
debug: sock_getopt (fd = 86, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 87, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
[P3:shim] debug: IPC worker started
debug: sock_getopt (fd = 330, sockopt addr = 0x7f5a84f25f70) is not implemented and always returns 0
[P3:T37:python] debug: ipc_change_id_owner: sending a request (37..3)
debug: sock_getopt (fd = 88, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 263, sockopt addr = 0x7f91a51eaf70) is not implemented and always returns 0
[P3:T37:python] debug: Sending ipc message to 1
[P3:T37:python] debug: Waiting for a response to 1
[P1:shim] debug: IPC worker: received IPC message from 3: code=5 size=42 seq=1
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P1:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f7e91d (IP = +0x2891d, VMID = 1, TID = 0)
debug: DkProcessExit: Returning exit code 1
[P2:shim] debug: IPC leader disconnected
[P2:shim] debug: Unknown process (vmid: 0x1) disconnected

dimakuv commented 3 years ago

@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template?

boryspoplawski commented 3 years ago

Also what is the exact commit hash you've run this on?

glorysdj commented 3 years ago

Also what is the exact commit hash you've run this on?

the commit hash is fb71e4376a1fa797697832ca5cbd7731dc7f8793 and also on 1.2-rc1, same issue happens and on previous commit 1b8848bdae09fa0e92e4644b7fbd4e8c6cd1718c which is before 1.2-rc1, it works well

glorysdj commented 3 years ago

@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template?

actually we use bash to start java/python programs please refer to this manifest https://github.com/intel-analytics/analytics-zoo/blob/master/ppml/trusted-big-data-ml/python/docker-graphene/bash.manifest.template

any comments and help will be appreciated!

boryspoplawski commented 3 years ago

That commit is 5 months old... Please try the newest master from the new repository (at the time of writing: https://github.com/gramineproject/gramine/commit/ff5a2da0dd577f47204a546465aab6797bce5d83)

glorysdj commented 3 years ago

That commit is 5 months old... Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)

sorry, I pasted the wrong commit, actually we are now running on fb71e4376a1fa797697832ca5cbd7731dc7f8793, and encounter this issue.

dimakuv commented 3 years ago

@glorysdj Unfortunately, the manifest you provided requires a very specific environment on the machine (Python3.6, Java, Spark, something called zinc, etc). So we won't be able to reproduce this.

Anyway, @boryspoplawski, do you have an idea what the debug log above could indicate? In particular, why this could trigger:

[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153

In other words, why the main process P1 wouldn't know all ancestor IDs?

boryspoplawski commented 3 years ago

@dimakuv no idea, sounds like a serious bug, but we've never seen that before...

@glorysdj @llly Could you please post a trace log of a full execution? Maybe then we could tell more (the provided snipped skips too much details).

glorysdj commented 3 years ago

test-orca-tf-text-sgx.log

trace log is attached.

also you can run it with below docker image:

export ENCLAVE_KEY=xxx/enclave-key.pem
export LOCAL_IP=x.x.x.x

docker run -itd \
--privileged \
--net=host \
--cpuset-cpus="26-30" \
--oom-kill-disable \
--device=/dev/gsgx \
--device=/dev/sgx/enclave \
--device=/dev/sgx/provsion \
-v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
--name=spark-local \
-e LOCAL_IP=$LOCAL_IP \
-e SGX_MEM_SIZE=32G \
intelanalytics/analytics-zoo-ppml-trusted-big-data-ml-python-graphene:0.12-k8s bash

attach the container

docker exec -it spark-local bash
cd /ppml/trusted-big-data-ml/

init the sgx manifest

./init.sh

run the workload

graphene-sgx ./bash -c "/opt/jdk8/bin/java -cp \
  '/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar:${SPARK_HOME}/conf/:${SPARK_HOME}/jars/*' \
  -Xmx3g \
  org.apache.spark.deploy.SparkSubmit \
  --master 'local[4]' \
  --conf spark.driver.memory=3g \
  --conf spark.executor.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --conf spark.driver.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --properties-file /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/conf/spark-analytics-zoo.conf \
  --jars /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
  --py-files /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-python-api.zip \
  --executor-memory 3g \
  --executor-cores 2 \
  --driver-cores 2 \
  /ppml/trusted-big-data-ml/work/examples/pyzoo/orca/learn/tf/basic_text_classification/basic_text_classification.py \
  --cluster_mode local" | tee test-orca-tf-text-sgx.log

glorysdj commented 3 years ago

@dimakuv @boryspoplawski any insights?

boryspoplawski commented 3 years ago

@glorysdj Could you check if https://github.com/gramineproject/gramine/pull/109 fixes this issue?

glorysdj commented 3 years ago

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

yes, we will try this

glorysdj commented 3 years ago

@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?

have tried with latest gramine, but encountered another issues, when run a very simple java program. will try to summarize the issue.

gramineproject / graphene

BUG() triggered during vfork and clone #2672