Open llly opened 3 years ago
@llly Could you try it on the latest Graphene? After the v1.2-rc
release, we had a bunch of bug fixes and small changes to the IPC (Inter-Process Communication) subsystem in Graphene.
In particular, when you move to the latest Graphene, you will notice more meaningful Process IDs (instead of your current P4346
, P5015
there will be more meaningful P1
for main process, P2
for the first child process, etc).
Hopefully this bug will disappear after you update. If not, please post a similar debug log for the updated run.
@boryspoplawski Please keep an eye on this too. Also, looks like this message:
debug: ipc_change_id_owner: sending a request (524..5015)
should actually be:
debug: ipc_change_id_owner: sending a request (524 -> 5015)
(Meaning that TID 524
should get a new owner with PID 5015
.)
@dimakuv I can reproduce this failure on latest Graphene trunk.
This time it's caused by clone
a process at the very beginning of Python program.
[P2:T21:python] trace: ---- shim_clone(CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|[SIGCHLD], 0x0, 0x0, 0xcf740ca10, 0x0) ...
[P2:T21:python] debug: ipc_get_new_vmid: sending a request
[P2:T21:python] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: sending a request: [33..36]
[P2:T21:python] debug: Waiting for a response to 3
[P2:shim] debug: Sending ipc message to 1
[P2:shim] debug: ipc_release_id_range: ipc_send_message: 0
[P1:shim] debug: IPC worker: received IPC message from 2: code=1 size=34 seq=3
[P1:shim] debug: ipc_get_new_vmid_callback: 3
[P1:shim] debug: Sending ipc message to 2
[P1:shim] debug: IPC worker: received IPC message from 2: code=4 size=42 seq=0
[P2:shim] debug: IPC worker: received IPC message from 1: code=0 size=38 seq=3
[P1:shim] debug: ipc_release_id_range_callback: release_id_range(33..36)
[P2:T21:python] debug: Waiting finished: 0
[P2:shim] debug: Got an IPC response from 1, seq: 3
[P2:T21:python] debug: ipc_get_new_vmid: got a response: 3
......
[P2:T21:python] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P2:T21:python] debug: complete checkpointing data
[P2:T21:python] debug: checkpoint of 178440 bytes created
......
[P3:T37:python] debug: successfully restored checkpoint at 0xfb89fe000 - 0xfb8a29908
[P3:T37:python] debug: Creating pipe: pipe.srv:3
debug: sock_getopt (fd = 86, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 87, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
[P3:shim] debug: IPC worker started
debug: sock_getopt (fd = 330, sockopt addr = 0x7f5a84f25f70) is not implemented and always returns 0
[P3:T37:python] debug: ipc_change_id_owner: sending a request (37..3)
debug: sock_getopt (fd = 88, sockopt addr = 0x7ffcbf790280) is not implemented and always returns 0
debug: sock_getopt (fd = 263, sockopt addr = 0x7f91a51eaf70) is not implemented and always returns 0
[P3:T37:python] debug: Sending ipc message to 1
[P3:T37:python] debug: Waiting for a response to 1
[P1:shim] debug: IPC worker: received IPC message from 3: code=5 size=42 seq=1
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
error: Unknown or illegal instruction executed
[P1:shim] error: Illegal instruction during Graphene internal execution at 0xfe4f7e91d (IP = +0x2891d, VMID = 1, TID = 0)
debug: DkProcessExit: Returning exit code 1
[P2:shim] debug: IPC leader disconnected
[P2:shim] debug: Unknown process (vmid: 0x1) disconnected
@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in python.manifest.template
?
Also what is the exact commit hash you've run this on?
Also what is the exact commit hash you've run this on?
the commit hash is fb71e4376a1fa797697832ca5cbd7731dc7f8793 and also on 1.2-rc1, same issue happens and on previous commit 1b8848bdae09fa0e92e4644b7fbd4e8c6cd1718c which is before 1.2-rc1, it works well
@llly How can we quickly reproduce it? What is the Python script you tried and is there anything special you did in
python.manifest.template
?
actually we use bash to start java/python programs please refer to this manifest https://github.com/intel-analytics/analytics-zoo/blob/master/ppml/trusted-big-data-ml/python/docker-graphene/bash.manifest.template
any comments and help will be appreciated!
That commit is 5 months old... Please try the newest master from the new repository (at the time of writing: https://github.com/gramineproject/gramine/commit/ff5a2da0dd577f47204a546465aab6797bce5d83)
That commit is 5 months old... Please try the newest master from the new repository (at the time of writing: gramineproject/gramine@ff5a2da)
sorry, I pasted the wrong commit, actually we are now running on fb71e4376a1fa797697832ca5cbd7731dc7f8793, and encounter this issue.
@glorysdj Unfortunately, the manifest you provided requires a very specific environment on the machine (Python3.6, Java, Spark, something called zinc
, etc). So we won't be able to reproduce this.
Anyway, @boryspoplawski, do you have an idea what the debug log above could indicate? In particular, why this could trigger:
[P1:shim] debug: ID 37 unknown!
[P1:shim] error: BUG() ../LibOS/shim/src/ipc/shim_ipc_pid.c:153
In other words, why the main process P1 wouldn't know all ancestor IDs?
@dimakuv no idea, sounds like a serious bug, but we've never seen that before...
@glorysdj @llly Could you please post a trace log of a full execution? Maybe then we could tell more (the provided snipped skips too much details).
trace log is attached.
also you can run it with below docker image:
export ENCLAVE_KEY=xxx/enclave-key.pem
export LOCAL_IP=x.x.x.x
docker run -itd \
--privileged \
--net=host \
--cpuset-cpus="26-30" \
--oom-kill-disable \
--device=/dev/gsgx \
--device=/dev/sgx/enclave \
--device=/dev/sgx/provsion \
-v /var/run/aesmd/aesm.socket:/var/run/aesmd/aesm.socket \
-v $ENCLAVE_KEY:/graphene/Pal/src/host/Linux-SGX/signer/enclave-key.pem \
--name=spark-local \
-e LOCAL_IP=$LOCAL_IP \
-e SGX_MEM_SIZE=32G \
intelanalytics/analytics-zoo-ppml-trusted-big-data-ml-python-graphene:0.12-k8s bash
attach the container
docker exec -it spark-local bash
cd /ppml/trusted-big-data-ml/
init the sgx manifest
./init.sh
run the workload
graphene-sgx ./bash -c "/opt/jdk8/bin/java -cp \
'/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar:${SPARK_HOME}/conf/:${SPARK_HOME}/jars/*' \
-Xmx3g \
org.apache.spark.deploy.SparkSubmit \
--master 'local[4]' \
--conf spark.driver.memory=3g \
--conf spark.executor.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
--conf spark.driver.extraClassPath=/ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
--properties-file /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/conf/spark-analytics-zoo.conf \
--jars /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-jar-with-dependencies.jar \
--py-files /ppml/trusted-big-data-ml/work/analytics-zoo-0.12.0-SNAPSHOT/lib/analytics-zoo-bigdl_0.13.0-spark_3.1.2-0.12.0-SNAPSHOT-python-api.zip \
--executor-memory 3g \
--executor-cores 2 \
--driver-cores 2 \
/ppml/trusted-big-data-ml/work/examples/pyzoo/orca/learn/tf/basic_text_classification/basic_text_classification.py \
--cluster_mode local" | tee test-orca-tf-text-sgx.log
@dimakuv @boryspoplawski any insights?
@glorysdj Could you check if https://github.com/gramineproject/gramine/pull/109 fixes this issue?
@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?
yes, we will try this
@glorysdj Could you check if gramineproject/gramine#109 fixes this issue?
have tried with latest gramine, but encountered another issues, when run a very simple java program. will try to summarize the issue.
I countered an
BUG()
whem migrating Java program from old Graphene version to latest 1.2 rc. It can reproduce every time. https://github.com/oscarlab/graphene/blob/f7aa86ed3ae4bbb99480d16bab98c87625b778c9/LibOS/shim/src/ipc/shim_ipc_pid.c#L150-L153 There are several fork() called in the Java program to run another system command. Other forks succeed. I cannot root cause it.Here is the log: