gramineproject / gramine

A library OS for Linux multi-process applications, with Intel SGX support
GNU Lesser General Public License v3.0
576 stars 190 forks source link

I/O Error: Permission denied #1822

Open ila opened 5 months ago

ila commented 5 months ago

Description of the problem

I am testing my benchmarks on a new machine and suddenly Gramine won't work as expected. The workflow I am running is the following:

  1. The benchmarking suite is written in Python
  2. Gramine calls Python3 with the correspondent executable
  3. We open a connection to our database and query reading from a database file
  4. However, this process fails (see below). All queries not reading from a file are successful, so it might really be a permission error. I also tried the following: 1) verifying there is no lock on the database or on the file, 2) changing Gramine loader user, 3) changing path of the file.

The very same steps worked on my previous machine (Ubuntu 22.10), now I am using Ubuntu 23.10 (with updated Gramine, so maybe it's a issue of the latest release). I tried Gramine 1.6 but the behaviour is the same.

The files are in the manifest as allowed files

The script fails with gramine-direct and gramine-sgx, however succeeds without issues outside Gramine

Permissions on the file are 777 for all files, owned by the same user calling Gramine, same group

Also tried with sudo and gives the same error

Using Gramine built from source (Gramine 1.6.2post~UNRELEASED (e3d3aa39c11054dfa3ce1ad2ce715ec24d730f1a) but I also tried with prepackaged Gramine

CPU is INTEL XEON PLATINUM 8570

Logs just before crashing report:

[P1:T1:python3.11] warning: Unsupported system call prctl

Could this be the issue? But I don't see how this would be related to reading from a file (or related to my application suddenly breaking on a new system)

Steps to reproduce

Not sure you want to reproduce this since it takes some time, but follow the "duckdb parquet encrypted" section here https://github.com/cwida/ClickBench-SGX/tree/main The very same error also appears executing this: https://github.com/cwida/DuckDB-SGX/tree/main/benchmarks

Also I am not sure this is even reproducible, as it only happened when I switched my machine to Ubuntu 23 and new hardware (but if you work at Intel you can probably get access to the thing)

Expected results

Gramine able to read files

Actual results

Traceback (most recent call last):
  File "//duckdb-parquet-encrypted/benchmark_duckdb.py", line 41, in <module>
    main()
  File "//duckdb-parquet-encrypted/benchmark_duckdb.py", line 36, in main
    run_query(query.strip(), output_log_path)
  File "//duckdb-parquet-encrypted/benchmark_duckdb.py", line 18, in run_query
    results = con.sql(query).fetchall()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
duckdb.duckdb.IOException: IO Error: Cannot open file "duckdb-parquet-encrypted/hits_encrypted.parquet": Permission denied
Error in sys.excepthook:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/apport_python_hook.py", line 66, in apport_excepthook
    import apport.report
  File "/usr/lib/python3/dist-packages/apport/__init__.py", line 7, in <module>
    from apport.packaging_impl import impl as packaging
  File "/usr/lib/python3/dist-packages/apport/packaging_impl/__init__.py", line 33, in <module>
    impl = load_packaging_implementation()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/apport/packaging_impl/__init__.py", line 27, in load_packaging_implementation
    module = importlib.import_module(
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/apport/packaging_impl/apt_dpkg.py", line 47, in <module>
    import apt
  File "/usr/lib/python3/dist-packages/apt/__init__.py", line 37, in <module>
    apt_pkg.init_system()
apt_pkg.Error: E:Error reading the CPU table

Original exception was:
Traceback (most recent call last):
  File "//duckdb-parquet-encrypted/benchmark_duckdb.py", line 41, in <module>
    main()
  File "//duckdb-parquet-encrypted/benchmark_duckdb.py", line 36, in main
    run_query(query.strip(), output_log_path)
  File "//duckdb-parquet-encrypted/benchmark_duckdb.py", line 18, in run_query
    results = con.sql(query).fetchall()
              ^^^^^^^^^^^^^^^^^^^^^^^^^
duckdb.duckdb.IOException: IO Error: Cannot open file "duckdb-parquet-encrypted/hits_encrypted.parquet": Permission denied

Gramine commit hash

e3d3aa39c11054dfa3ce1ad2ce715ec24d730f1a

mkow commented 5 months ago

The files are in the manifest as allowed files

Could you try running Gramine with loader.log_level set to trace and paste the relevant lines? (when trying to access this file and a few before that)

The very same steps worked on my previous machine (Ubuntu 22.10), now I am using Ubuntu 23.10

I guess it's some change in glibc which causes this...

ila commented 5 months ago

Hi, I uploaded the whole log: https://raw.githubusercontent.com/cwida/DuckDB-SGX/main/benchmarks/strace_new.txt

I tried grep'ing the relevant syscalls getuid, setuid but nothing interesting showed up (at least to my ignorant self) so I am not even sure what to look for. If I search for the file name, I see 200 ish calls looking like: access("orders.parquet", F_OK) = 0x0 openat(AT_FDCWD, "orders.parquet", O_RDONLY, 0000) = 0x4 Other access calls return -2, is that a problem? Sorry for lacking the knowledge to debug this further. Do you think it could be glibc? I can look into downgrading it and see if anyhing changes (although I am not sure I can reinstall the OS, so I would rather not). Thanks for the help :)

Update - this could be an issue with Python? If I port the application itself (a CLI executable, rather than the corresponding Python module), it works with gramine-direct.

For reference:

gramine-direct duckdb < myfile.sql # works no problem
gramine-direct benchmark mybenchmark.py # I/O issue`

Where mybenchmark.py reads and calls myfile.sql, and myfile.sql contains a select * from read_parquet('orders.parquet)

We are just using Python as a wrapper for helper functions rather than having to call an excessive number of forks with bash. If I do the same (as in: calling my CLI executable and giving it a query reading from a file) with gramine-sgx, the output is: error: Host injected malicious signal 2. I managed to fix it by increasing the enclave size. Could this be related to my Python issue?

#PF/#GP information in EXINFO in MISC region of SSA supported: true
#CP information in EXINFO in MISC region of SSA supported: false

Another update: I managed to edit this to a minimal dysfunctional Python example (still gives me I/O error). I also tried the setting to allow files even if they aren't trusted or allowed explicitly. Still breaks. This is my benchmark.py:

#!/usr/bin/python3

import os
import duckdb
import timeit
import sys

def main():

    print("Starting benchmark...")

    con = duckdb.connect()
    con.sql("select 1").fetchall()  # works
    con.sql("select * from parquet_scan('orders.parquet');").fetchall()  # breaks
    print("ok")
    con.close()

if __name__ == "__main__":
    main()

Edit, another interesting behaviour:

def main():

    print("Starting benchmark...")

    con = duckdb.connect("test.duckdb", read_only=False) # works
    con.sql("select 1").fetchall() # works
    con.sql("select * from parquet_scan('orders.parquet') limit 10000").fetchall()  # works
    con.sql("create table x as select * from parquet_scan('orders.parquet') limit 10000").fetchall()  # breaks
    con.sql("select * from x limit 10000;").fetchall()  # breaks
    print("ok")
    con.close()

if __name__ == "__main__":
    main()
ila commented 5 months ago

New update: it seems like Gramine breaks with queries without a LIMIT? Example:

def main():
    con = duckdb.connect(database="tpch.duckdb", read_only=False)
    # lineitem is a view defined with SELECT * FROM lineitem.parquet
    con.sql("select * from lineitem limit 100;").fetchall() # works
    con.sql("select * from lineitem;").fetchall() # breaks (I/O error)

if __name__ == "__main__":
    main()

This is beginning to stop making sense :)

Could it be related to stack/heap size? Memory allocated to Gramine? EPC size?

Another edit: I can indeed confirm that this I/O problem only happens when materializing large-size input. I have no idea how the output size could be related to I/O permissions, especially through gramine-direct. Maybe I need to change something in the BIOS?

mkow commented 5 months ago

Other access calls return -2, is that a problem?

Probably not, they are just checking what stuff is present, like /etc/ld.so.preload and cgroups (which probably are not required for the DB to work).

I uploaded the whole log: https://raw.githubusercontent.com/cwida/DuckDB-SGX/main/benchmarks/strace_new.txt

Hmm, the log is from a different script than the original post? The exceptions you pasted originally were about a different file (duckdb-parquet-encrypted/hits_encrypted.parquet). What was the exact error in the run you pasted the logs from?

Another edit: I can indeed confirm that this I/O problem only happens when materializing large-size input. I have no idea how the output size could be related to I/O permissions, especially through gramine-direct. Maybe I need to change something in the BIOS?

Are you using EDMM? If not, you may be exceeding enclave size limits (configurable in the manifest). I don't see any I/O errors in the strace log, but it's quite long, it would be easier if you managed to find the place when the app is failing (e.g. by adding some markers to the app right before and after the failure)

dimakuv commented 4 months ago

@ila Late to the party, but was this issue resolved or still open?

If it is still open, what would be the latest-working way to reproduce it? This would be interesting to debug.

ila commented 4 months ago

Yes sorry for the delay, this issue was actually in my backlog and I was going to take a deeper look at it these days. I just checked and the issue persists, I do suspect it might be caused by a file which is larger than the enclave. I tried replicating this with many scripts/files (which is why the logs I posted are slightly different from one another) and the error is still the same, I/O exception because of permissions. However, if the enclave is full and the file needs to be re-read from disk, it shouldn't give a permission error. My application is also not allowed to spill on disk.

If you want to reproduce, clone this https://github.com/cwida/ClickBench-SGX/tree/main and follow the "DuckDB (Parquet, encrypted)" section of the README -- you will probably need to edit the manifest file a bit since some paths are still hardcoded. You need Ubuntu 23.10 and Gramine latest stable. I can also provide access to the machine if needed.

I am also happy to debug further or look deeper into strace logs, but I'd need some instructions to know exactly what we are looking for.

morbitzer commented 4 months ago

I might be facing the same problem here. I apologize if it's actually something different, but also for me things work on one host but the same image crashes on another one with (pal_exception.c:303:_PalExceptionHandler) error: Host injected malicious signal 2.

But from the start. I created a docker image that starts a jar which outputs „Hello World“, as simple as that. Using gsc build and gsc sign-image, I converted the image to a gsc-image. Running the gsc-image on one host, an Intel NUC, works as expected. However, when I copy the image to an Azure VM (via docker save and docker load), I receive the following message: (pal_exception.c:303:_PalExceptionHandler) error: Host injected malicious signal 2 Note that when I use the python helloworld example from the documentation, the whole process works as expected on both hosts. The size of the python and the java gsc-image are similar, but my guess is that judging by the time it takes to load the images, Java consumes much more resources at runtime.

I attached the logs from both the working Intel NUC and the azure VM from which I receive the error. broken.log working.log Both run Ubuntu 22.02. Please let me know if there any other details I can compare. From my understanding, that it’s were it goes wrong: NUC:

(libos_parser.c:1658:buf_write_all) [P1:T2:java] trace: ---- openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/microcode/version", O_RDONLY|0x80000, 0000) = -2
(libos_parser.c:1658:buf_write_all) [P1:T2:java] trace: ---- openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY|0x80000, 0000) = 0x4

Azure VM:

(pal_exception.c:303:_PalExceptionHandler) error: Host injected malicious signal 2
(pal_process.c:248:_PalProcessExit) debug: PalProcessExit: Returning exit code 1

So I started the gsc-image on both hosts with -it --entrypoint /bin/bash to debug further, and it turns out that while I could read /sys/devices/system/cpu/cpu0/microcode/version from the docker-container on the NUC, the docker-container in the Azure VM said: cat: /sys/devices/system/cpu/cpu0/microcode/version: No such file or directory

What I still have to understand is why this file is accessible from the gsc-image on one host, but not on the other, when launched with the same configuration. Edit: Well, the obvious answer: The file /sys/devices/system/cpu/cpu0/microcode/version simply does not exist on the azure VM, which seems to crash the gsc-container.

dimakuv commented 4 months ago

Have you tried adding sgx.use_exinfo = true or sgx.insecure__allow_memfaults_without_exinfo = true. The former is used when EDMM is enabled, the latter is used when EDMM is disabled.

See https://gramine.readthedocs.io/en/stable/manifest-syntax.html#sgx-exinfo for explanations.

morbitzer commented 4 months ago

sgx.use_exinfo = true did indeed somewhat fix the issue. In the trace, I still see

(libos_parser.c:1658:buf_write_all) [P1:T2:java] trace: ---- openat(AT_FDCWD, "/sys/devices/system/cpu/cpu0/microcode/version", O_RDONLY|0x80000, 0000) = -2

but execution continues afterwards. Thank you!

mkow commented 3 months ago

@ila: Are you still seeing this problem? Did you try sgx.use_exinfo = true?