Open ila opened 5 months ago
The files are in the manifest as allowed files
Could you try running Gramine with loader.log_level
set to trace
and paste the relevant lines? (when trying to access this file and a few before that)
The very same steps worked on my previous machine (Ubuntu 22.10), now I am using Ubuntu 23.10
I guess it's some change in glibc which causes this...
Hi, I uploaded the whole log: https://raw.githubusercontent.com/cwida/DuckDB-SGX/main/benchmarks/strace_new.txt
I tried grep'ing the relevant syscalls getuid
, setuid
but nothing interesting showed up (at least to my ignorant self) so I am not even sure what to look for. If I search for the file name, I see 200 ish calls looking like:
access("orders.parquet", F_OK) = 0x0
openat(AT_FDCWD, "orders.parquet", O_RDONLY, 0000) = 0x4
Other access
calls return -2, is that a problem? Sorry for lacking the knowledge to debug this further. Do you think it could be glibc
? I can look into downgrading it and see if anyhing changes (although I am not sure I can reinstall the OS, so I would rather not). Thanks for the help :)
Update - this could be an issue with Python? If I port the application itself (a CLI executable, rather than the corresponding Python module), it works with gramine-direct
.
For reference:
gramine-direct duckdb < myfile.sql # works no problem
gramine-direct benchmark mybenchmark.py # I/O issue`
Where mybenchmark.py
reads and calls myfile.sql
, and myfile.sql
contains a select * from read_parquet('orders.parquet)
We are just using Python as a wrapper for helper functions rather than having to call an excessive number of forks with bash. If I do the same (as in: calling my CLI executable and giving it a query reading from a file) with gramine-sgx
, the output is: error: Host injected malicious signal 2
. I managed to fix it by increasing the enclave size. Could this be related to my Python issue?
#PF/#GP information in EXINFO in MISC region of SSA supported: true
#CP information in EXINFO in MISC region of SSA supported: false
Another update: I managed to edit this to a minimal dysfunctional Python example (still gives me I/O error). I also tried the setting to allow files even if they aren't trusted or allowed explicitly. Still breaks. This is my benchmark.py
:
#!/usr/bin/python3
import os
import duckdb
import timeit
import sys
def main():
print("Starting benchmark...")
con = duckdb.connect()
con.sql("select 1").fetchall() # works
con.sql("select * from parquet_scan('orders.parquet');").fetchall() # breaks
print("ok")
con.close()
if __name__ == "__main__":
main()
Edit, another interesting behaviour:
def main():
print("Starting benchmark...")
con = duckdb.connect("test.duckdb", read_only=False) # works
con.sql("select 1").fetchall() # works
con.sql("select * from parquet_scan('orders.parquet') limit 10000").fetchall() # works
con.sql("create table x as select * from parquet_scan('orders.parquet') limit 10000").fetchall() # breaks
con.sql("select * from x limit 10000;").fetchall() # breaks
print("ok")
con.close()
if __name__ == "__main__":
main()
New update: it seems like Gramine breaks with queries without a LIMIT? Example:
def main():
con = duckdb.connect(database="tpch.duckdb", read_only=False)
# lineitem is a view defined with SELECT * FROM lineitem.parquet
con.sql("select * from lineitem limit 100;").fetchall() # works
con.sql("select * from lineitem;").fetchall() # breaks (I/O error)
if __name__ == "__main__":
main()
This is beginning to stop making sense :)
Could it be related to stack/heap size? Memory allocated to Gramine? EPC size?
Another edit: I can indeed confirm that this I/O problem only happens when materializing large-size input. I have no idea how the output size could be related to I/O permissions, especially through gramine-direct
. Maybe I need to change something in the BIOS?
Other
access
calls return -2, is that a problem?
Probably not, they are just checking what stuff is present, like /etc/ld.so.preload
and cgroups (which probably are not required for the DB to work).
I uploaded the whole log: https://raw.githubusercontent.com/cwida/DuckDB-SGX/main/benchmarks/strace_new.txt
Hmm, the log is from a different script than the original post? The exceptions you pasted originally were about a different file (duckdb-parquet-encrypted/hits_encrypted.parquet
). What was the exact error in the run you pasted the logs from?
Another edit: I can indeed confirm that this I/O problem only happens when materializing large-size input. I have no idea how the output size could be related to I/O permissions, especially through
gramine-direct
. Maybe I need to change something in the BIOS?
Are you using EDMM? If not, you may be exceeding enclave size limits (configurable in the manifest). I don't see any I/O errors in the strace log, but it's quite long, it would be easier if you managed to find the place when the app is failing (e.g. by adding some markers to the app right before and after the failure)
@ila Late to the party, but was this issue resolved or still open?
If it is still open, what would be the latest-working way to reproduce it? This would be interesting to debug.
Yes sorry for the delay, this issue was actually in my backlog and I was going to take a deeper look at it these days. I just checked and the issue persists, I do suspect it might be caused by a file which is larger than the enclave. I tried replicating this with many scripts/files (which is why the logs I posted are slightly different from one another) and the error is still the same, I/O exception because of permissions. However, if the enclave is full and the file needs to be re-read from disk, it shouldn't give a permission error. My application is also not allowed to spill on disk.
If you want to reproduce, clone this https://github.com/cwida/ClickBench-SGX/tree/main and follow the "DuckDB (Parquet, encrypted)" section of the README -- you will probably need to edit the manifest file a bit since some paths are still hardcoded. You need Ubuntu 23.10 and Gramine latest stable. I can also provide access to the machine if needed.
I am also happy to debug further or look deeper into strace
logs, but I'd need some instructions to know exactly what we are looking for.
I might be facing the same problem here. I apologize if it's actually something different, but also for me things work on one host but the same image crashes on another one with (pal_exception.c:303:_PalExceptionHandler) error: Host injected malicious signal 2
.
But from the start. I created a docker image that starts a jar which outputs „Hello World“, as simple as that. Using gsc build
and gsc sign-image
, I converted the image to a gsc-image. Running the gsc-image on one host, an Intel NUC, works as expected.
However, when I copy the image to an Azure VM (via docker save
and docker load
), I receive the following message:
(pal_exception.c:303:_PalExceptionHandler) error: Host injected malicious signal 2
Note that when I use the python helloworld example from the documentation, the whole process works as expected on both hosts. The size of the python and the java gsc-image are similar, but my guess is that judging by the time it takes to load the images, Java consumes much more resources at runtime.
I attached the logs from both the working Intel NUC and the azure VM from which I receive the error. broken.log working.log Both run Ubuntu 22.02. Please let me know if there any other details I can compare. From my understanding, that it’s were it goes wrong: NUC:
(libos_parser.c:1658:buf_write_all) [P1:T2:java] trace: ---- openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/microcode/version", O_RDONLY|0x80000, 0000) = -2
(libos_parser.c:1658:buf_write_all) [P1:T2:java] trace: ---- openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY|0x80000, 0000) = 0x4
Azure VM:
(pal_exception.c:303:_PalExceptionHandler) error: Host injected malicious signal 2
(pal_process.c:248:_PalProcessExit) debug: PalProcessExit: Returning exit code 1
So I started the gsc-image on both hosts with -it --entrypoint /bin/bash
to debug further, and it turns out that while I could read /sys/devices/system/cpu/cpu0/microcode/version
from the docker-container on the NUC, the docker-container in the Azure VM said:
cat: /sys/devices/system/cpu/cpu0/microcode/version: No such file or directory
What I still have to understand is why this file is accessible from the gsc-image on one host, but not on the other, when launched with the same configuration.
Edit:
Well, the obvious answer: The file /sys/devices/system/cpu/cpu0/microcode/version
simply does not exist on the azure VM, which seems to crash the gsc-container.
Have you tried adding sgx.use_exinfo = true
or sgx.insecure__allow_memfaults_without_exinfo = true
. The former is used when EDMM is enabled, the latter is used when EDMM is disabled.
See https://gramine.readthedocs.io/en/stable/manifest-syntax.html#sgx-exinfo for explanations.
sgx.use_exinfo = true
did indeed somewhat fix the issue. In the trace, I still see
(libos_parser.c:1658:buf_write_all) [P1:T2:java] trace: ---- openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/microcode/version", O_RDONLY|0x80000, 0000) = -2
but execution continues afterwards. Thank you!
@ila: Are you still seeing this problem? Did you try sgx.use_exinfo = true
?
Description of the problem
I am testing my benchmarks on a new machine and suddenly Gramine won't work as expected. The workflow I am running is the following:
The very same steps worked on my previous machine (Ubuntu 22.10), now I am using Ubuntu 23.10 (with updated Gramine, so maybe it's a issue of the latest release). I tried Gramine 1.6 but the behaviour is the same.
The files are in the manifest as allowed files
The script fails with
gramine-direct
andgramine-sgx
, however succeeds without issues outside GraminePermissions on the file are 777 for all files, owned by the same user calling Gramine, same group
Also tried with
sudo
and gives the same errorUsing Gramine built from source (Gramine 1.6.2post~UNRELEASED (e3d3aa39c11054dfa3ce1ad2ce715ec24d730f1a) but I also tried with prepackaged Gramine
CPU is INTEL XEON PLATINUM 8570
Logs just before crashing report:
Could this be the issue? But I don't see how this would be related to reading from a file (or related to my application suddenly breaking on a new system)
Steps to reproduce
Not sure you want to reproduce this since it takes some time, but follow the "duckdb parquet encrypted" section here https://github.com/cwida/ClickBench-SGX/tree/main The very same error also appears executing this: https://github.com/cwida/DuckDB-SGX/tree/main/benchmarks
Also I am not sure this is even reproducible, as it only happened when I switched my machine to Ubuntu 23 and new hardware (but if you work at Intel you can probably get access to the thing)
Expected results
Gramine able to read files
Actual results
Gramine commit hash
e3d3aa39c11054dfa3ce1ad2ce715ec24d730f1a