Closed sampleyang closed 1 year ago
@sampleyang Why are you using exitless? Pytorch does not benefit much from using exitless, so don't use RPC threads and see if you get better results. I also see a lot of sgx_ocall_sched_yield calls in the ocall_outer log. I thought we did look into this problem last year where we make sched_yield a no-op. But i also see that the https://github.com/gramineproject/gramine/pull/213/ PR from Borys was not merged. @boryspoplawski @mkow any thoughts?
@monavij Making sched_yield
a no-op gave like very minor performance improvements to one specific workload and slowed down super heavily most of the workloads
@sampleyang Why are you using exitless? Pytorch does not benefit much from using exitless, so don't use RPC threads and see if you get better results. I also see a lot of sgx_ocall_sched_yield calls in the ocall_outer log. I thought we did look into this problem last year where we make sched_yield a no-op. But i also see that the #213 PR from Borys was not merged. @boryspoplawski @mkow any thoughts?
@monavij I just tried almost all the methods in this doc "https://gramine.readthedocs.io/en/stable/performance.html", eixtless is just one of them. My table contains the performance without exitless, which seems is the best performance in gramine-sgx for my application.
For the start cost, maybe related with #683, currently gramine does not support sgx2 and edmm, and will initialize all enclave memory at once during startup, for the big enclave memory(64G) maybe cost a lot of time.
For the run cost, i am not sure if it is related with #853. From the profiling file(call_inner.txt) can see 'ocall_futex' cost the most time. PyTorch will use openMP for parallel computing, whether the performance will be affected when the number of threads(>= 16threads) reaches a certain level in gramine.
Did you use the Gramine patched OpenMP library? See a comment in our pytorch example - https://github.com/gramineproject/examples/blob/master/pytorch/pytorch.manifest.template
@sampleyang My suspicion is also that you're using a "vanilla" OpenMP library, which issues raw syscalls. In Gramine, we do have a workaround for this, though it's not amazing: https://github.com/gramineproject/examples/blob/553394fcee0f6f878bea19fadb0de6548d824f1a/pytorch/pytorch.manifest.template#L62-L70
But also, if I remember correctly, there are distributions of PyTorch that come with "improved" OpenMP library, like the Intel's OpenMP Runtime Library. See https://www.intel.com/content/www/us/en/developer/articles/technical/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration.html (search for libiomp5
). These PyTorch distributions show much better performance under SGX.
@sampleyang My suspicion is also that you're using a "vanilla" OpenMP library, which issues raw syscalls. In Gramine, we do have a workaround for this, though it's not amazing: https://github.com/gramineproject/examples/blob/553394fcee0f6f878bea19fadb0de6548d824f1a/pytorch/pytorch.manifest.template#L62-L70
@dimakuv @monavij
Thanks for your help. I can have a try about this two suggestions. And there are some questions with use . For 1st gramine patched OpenMP runtime library, where can i get the patched OpenMP and where i need to execute make -C LibOS gcc
? I think LibOS
is a directory before gramine v1.3, and now is libos
, also there is no gcc
directory under libos
or LibOS
, maybe the document is not the latest. Do you have a detail document i can follow.
@sampleyang You are correct about the first suggestion. This is a "bug" in the comment. I fixed it in this PR: https://github.com/gramineproject/examples/pull/45
Basically, you need to build Gramine something like this:
cd gramine/
meson setup build/ --buildtype=release -Ddirect=enabled -Dsgx=enabled -Dlibgomp=enabled
ninja -C build/
sudo ninja -C build/ install
Note the added -Dlibgomp=enabled
flag! This is what builds the patched OpenMP library.
@sampleyang You are correct about the first suggestion. This is a "bug" in the comment. I fixed it in this PR: gramineproject/examples#45
Basically, you need to build Gramine something like this:
cd gramine/ meson setup build/ --buildtype=release -Ddirect=enabled -Dsgx=enabled -Dlibgomp=enabled ninja -C build/ sudo ninja -C build/ install
Note the added
-Dlibgomp=enabled
flag! This is what builds the patched OpenMP library.
@dimakuv Thanks. For the 1st suggestion with Gramine patched OpenMP library, the performance for my application runtime improved, and also have about 47% performance degradation. The data as follow:
Pytorch 1.8.1 | Non-Gramine | Gramine-SGX | Gramine-SGX with patched OpenMP |
---|---|---|---|
Start Cost(seconds) | <1s | 63.22s | 62.61s |
Run Cost(seconds) | 295.79s | 497.71s(↓68.26%) | 434.72s(↓46.97%) |
Tost Cost(seconds) | 296.23s | 572.78s(↓93.36%) | 509.50s(↓71.99%) |
For the 2nd suggestion with intel improved PyTorch, i also have a try, but find some problem. My ops as follow for installation:
python3 -m pip install intel_extension_for_pytorch==1.11.0
python3 -m pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable
After install I just follow the doc you applied. Because there is only 1 numa node on my machine, so i just jump the OpenMp section.
First, I didn't find library libiomp*.so
. I noticed that the doc was published at 2019, maybe it is not the latest. And I just set the env follow the doc and test the app without gramine-sgx, the performance didn't improve, but it decreased.
export OMP_NUM_THREADS=16
export MKL_NUM_THREADS=16
export LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/torch/lib/libgomp.so.1 // this is the pytorch openmp lib
export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
export GOMP_CPU_AFFINITY="0-15"
export KMP_AFFINITY=granularity=fine,proclist=[0-15],explicit
Second, i find that the intel_extension_for_pytorch and oneccl_bind_pt didn't install any optimized OpenMP library. From the examples, it seems application need to use the sdk for model training and then inference. So I am not sure whether i need modify my application and model to fit intel improved PyTorch. I also find a models zoo on GitHub(https://github.com/IntelAI/models/tree/pytorch-r1.10-models) which seems there is an optimization coverage for the models. Maybe i need to modify a lot of things from training to inference for test this.
@sampleyang Thanks for the great summary! Really appreciate your detailed posts!
It's a pity you weren't able to use the "optimized Intel OpenMP library". To be honest, I last tested it a couple years ago, so maybe something changed there? Maybe others know more about how it works these days, and whether it gives significant performance boost: @svenkata9 @aneessahib @anjalirai-intel @jkr0103 .
I haven't recently used OMP library. Can @sampleyang try the stuff listed here? https://github.com/gramineproject/examples/blob/master/openvino/README.md#performance-considerations
We have seen recent perf boost with OMP libs. @jkr0103 - can you detail the steps here for @sampleyang
@sampleyang Could you try the optimized libraries I shared over email and report the results. If that doesn't help, could you share the sample for us to reproduce the issue locally if possible?
@jkr0103 Thanks.
I have run my application with the library(libiomp5.so
) you applied. And also notice that one configuration sys.brk.max_size=4G
, so just make a comparison with the library(libgomp.so
) @dimakuv applied. The data as follow:
Pytorch 1.8.1 | Non-Gramine | Gramine-SGX | Gramine-SGX With Patched OpenMP | Gramine-SGX With Patched OpenMP and brk.max_size=4G | Gramine-SGX With libiomp5.so | Gramine-SGX With libiomp5.so and brk.max_size=4G |
---|---|---|---|---|---|---|
Start Cost(seconds) | <1s | 63.22s | 62.61s | 62.34s | 62.23s | 63.47s |
Run Cost(seconds) | 295.79s | 497.71s(↓68.26%) | 434.72s(↓46.97%) | 369.80s(↓25.02%) | 365.76s(23.66↓%) | 363.13s(↓22.77%) |
Tost Cost(seconds) | 296.23s | 572.78s(↓93.36%) | 509.50s(↓71.99%) | 444.60s(↓50.09%) | 440.38s(48.66↓%) | 439.01s(↓48.20%) |
Column 5
) seems brk.max_size = 4G
can further improve performance with the library gramine patched OpenMP @dimakuv applied.Column 6
) also can improve performance effectively. But the performance improvement is not obvious after add brk.max_size = 4G
(Column 7
). Maybe the overhead 22%~25% for gramine runtime on my application has reached a limit for now?Other configurations mentioned in the email:
libos.check_invalid_pointers = false
caused a exception during startup.
[P1:T1:python3] debug: Allocated stack at 0x47bffd000 (size = 0x800000)
[P1:T1:python3] debug: loading "file://usr/bin/python3"
[P1:T1:python3] debug: find_interp: searching for interpreter: /lib/ld-linux-x86-64.so.2
[P1:T1:python3] debug: loading "file:/opt/ccp/gramine/lib/x86_64-linux-gnu/gramine/runtime/glibc/ld-linux-x86-64.so.2"
[P1:T1:python3] debug: execve: start execution
[P1:T1:python3] warning: Not supported flag (0x3001) passed to arch_prctl
[P1:T1:python3] debug: glibc register library /root/mts-ccp/lib_cpu_optimization/libiomp5.so loaded at 0x4aba00000
[P1:T1:python3] debug: glibc register library /root/mts-ccp/lib_cpu_optimization/libtcmalloc.so loaded at 0x4ab400000
[P1:T1:python3] debug: glibc register library /lib/libc.so.6 loaded at 0x4ab806000
[P1:T1:python3] debug: glibc register library /lib/libpthread.so.0 loaded at 0x4abec9000
[P1:T1:python3] debug: glibc register library /lib/libdl.so.2 loaded at 0x4abec2000
[P1:T1:python3] debug: glibc register library /lib/libutil.so.1 loaded at 0x4abebd000
[P1:T1:python3] debug: glibc register library /lib/libm.so.6 loaded at 0x4ab322000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libexpat.so.1 loaded at 0x4abe8f000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libz.so.1 loaded at 0x4abe73000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 loaded at 0x4abe58000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libstdc++.so.6 loaded at 0x4ab140000
[P1:T1:python3] warning: Unsupported system call rseq
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: Unsupported system call faccessat2
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac5d000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libffi.so.7 loaded at 0x4abe06000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_queue.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac57000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_opcode.cpython-38-x86_64-linux-gnu.so loaded at 0x4ab801000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_bz2.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac4f000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libbz2.so.1.0 loaded at 0x4aabad000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_lzma.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac43000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/liblzma.so.5 loaded at 0x4aab84000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_global_deps.so loaded at 0x4aa400000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libgomp-a34b3233.so.1 loaded at 0x4aa000000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/_C.cpython-38-x86_64-linux-gnu.so loaded at 0x4a9c00000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so loaded at 0x4a8a00000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libshm.so loaded at 0x4a8600000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch.so loaded at 0x4a8200000
[P1:T1:python3] debug: glibc register library /lib/librt.so.1 loaded at 0x4aaafb000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so loaded at 0x4a7e00000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so loaded at 0x496600000
[P1:T1:python3] error: Internal memory fault at 0x00000000 (0xffce66b9a, VMID = 1, TID = 1)
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1
sgx.preheat_enclave
this will double the enclave load time to 130s./sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor
my machine doesn't have.The overhead (runtime cost) of ~25% sounds reasonable for a heavily-multithreaded workload.
I noticed that you compared Gramine-SGX with OMP_NUM_THREADS = "8"
against the native with OMP_NUM_THREADS
not set. In other words, you compared an 8-threads run of Gramine-SGX against a 32-threads run of native (32 threads because PyTorch spawns as many threads as there are CPU cores available, unless there is an explicit limit OMP_NUM_THREADS
).
@sampleyang So another interesting experiment you could do is to try different OMP_NUM_THREADS
, both in native (by prepending the command line with OMP_NUM_THREADS=8 python3 ...
) and in Gramine-SGX (by adding the line loader.env.OMP_NUM_THREADS = "8"
in the manifest file).
The overhead (runtime cost) of ~25% sounds reasonable for a heavily-multithreaded workload.
I noticed that you compared Gramine-SGX with
OMP_NUM_THREADS = "8"
against the native withOMP_NUM_THREADS
not set. In other words, you compared an 8-threads run of Gramine-SGX against a 32-threads run of native (32 threads because PyTorch spawns as many threads as there are CPU cores available, unless there is an explicit limitOMP_NUM_THREADS
).@sampleyang So another interesting experiment you could do is to try different
OMP_NUM_THREADS
, both in native (by prepending the command line withOMP_NUM_THREADS=8 python3 ...
) and in Gramine-SGX (by adding the lineloader.env.OMP_NUM_THREADS = "8"
in the manifest file).
@dimakuv
I have tried this configuration before,OMP_NUM_THREADS = "8"
or OMP_NUM_THREADS = "16"
cannot improve performance effectively on my application, and it will reduce the performance. On my machine the native will be 16-threads for pytorch. There is a test data for OMP_NUM_THREADS = "8"
in my first report.
For the library(libiomp5.so and libtcmalloc.so) applied by @jkr0103, I saw that sent to me via email. So the library is not a release version? or how i can get a official release version?if it is not release, maybe my best choice is gramine patched OpenMP library at present. These two libraries have almost the same impact on performance. Do you have any suggestions for this @dimakuv ?
OMP_NUM_THREADS = "8" or OMP_NUM_THREADS = "16" cannot improve performance effectively on my application, and it will reduce the performance.
Yes, I know this. But my point was -- in your table, you compared native (with 16 threads, according to your comment) run with a Gramine-SGX (with 8 threads) run. This is misleading, you can't compare two different configurations.
Anyway, this was just a remark. I don't see any quick ways to decrease your 25% overhead even further. You need to analyze performance bottlenecks (via Gramine's perf
tooling) and maybe debug a bit.
OMP_NUM_THREADS = "8" or OMP_NUM_THREADS = "16" cannot improve performance effectively on my application, and it will reduce the performance.
Yes, I know this. But my point was -- in your table, you compared native (with 16 threads, according to your comment) run with a Gramine-SGX (with 8 threads) run. This is misleading, you can't compare two different configurations.
Anyway, this was just a remark. I don't see any quick ways to decrease your 25% overhead even further. You need to analyze performance bottlenecks (via Gramine's
perf
tooling) and maybe debug a bit.
Maybe my description is not clear enough above. The pytorch default use cpus/2 for the thread number . So on my machine without any configuration it is 16-threads for native default. And in my 1st report table, Non-Gramine
Column means with native 16-threads, Gramine-SGX
Column also means with native 16-thread in Gramine-SGX(Because default is 16), and OMP_NUM_THREADS = "8"
means with 8-threads after setted, and i also test OMP_NUM_THREADS = "16"
but didn't report.
The overhead (runtime cost) of ~25% for runtime is currently acceptable. Thanks for your help.
Hi, @jkr0103
I'm curious about the PyTorch performance in the SGX enclave. Currently, I'm using Gramine v1.3.1 with PyTorch 1.13.1.
I've also conducted the following experiment:
I tried to use the Intel OpenMP library to boost the performance of PyTorch training, which heavily utilizes OpenMP. However, when I added libiomp5.so
to sgx.allowed_files and loader.env.LD_PRELOAD, and ran my application, I encountered the following error:
The environment variable setting:
export OMP_NUM_THREADS=32
export MKL_NUM_THREADS=32
export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
export GOMP_CPU_AFFINITY="1-32"
export KMP_AFFINITY=granularity=fine,proclist=[1-32],explicit
root@tdx-master:/ppml# gramine-sgx bash
Gramine is starting. Parsing TOML manifest file, this may take some time...
-----------------------------------------------------------------------------------------------------------------------
Gramine detected the following insecure configurations:
- loader.insecure__use_host_env = true (forwarding environment vars from untrusted host to the app)
- sgx.file_check_policy = allow_all_but_log (all files are passed through from untrusted host without verification)
- sgx.allowed_files = [ ... ] (some files are passed through from untrusted host without verification)
Gramine will continue application execution, but this configuration must not be used in production!
-----------------------------------------------------------------------------------------------------------------------
bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
OMP: Error #179: Function Can't open SHM2 failed:
OMP: System error #2: No such file or directory
Here is the all-level log information:
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/dev/shm/__KMP_REGISTERED_LIB_1_0", O_RDWR|O_CREAT|O_EXCL|0xa0000, 0666) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C.UTF-8/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C.UTF-8/LC_MESSAGES/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C/LC_MESSAGES/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- sched_yield() = 0x0
[P1:T1:python3] trace: ---- write(2, 0x8de717090, 0x62) ...
OMP: Error #179: Function Can't open SHM2 failed:
Could you please help me understand whether this is a configuration error in my manifest file or whether we need a special version of libiomp5.so
for Gramine?
I tried to profile my application, but encounter the following error after building gramine with commands
meson setup build/ --buildtype=debugoptimized -Dsgx=enabled -Dsgx_driver=dcap1.10 -Dlibgomp=enabled
root@tdx-master:/ppml# gramine-sgx bash
Gramine is starting. Parsing TOML manifest file, this may take some time...
error: sgx_profile_report_elf([vdso]): realpath failed
error: Initializing enclave failed: -1
error: load_enclave() failed with error -1
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/dev/shm/__KMP_REGISTERED_LIB_1_0", O_RDWR|O_CREAT|O_EXCL|0xa0000, 0666) = -2
@gc-fu This is the problem in libiomp5.so
. Gramine doesn't support shared memory (/dev/shm/
, or SHM2
). So the Intel OpenMP library fails.
But this is unexpected behavior to me. I don't remember that Intel OpenMP library requires shared memory.
Well, looks like OpenMP indeed got this support for shared memory, and I don't see how it can be disabled. Some links:
Looks like you'll have to continue experimenting with this... I can't help here.
Thanks for your reply, I will try to see if I can find some workaround for using intel-OpenMP
Description of the problem
I try to move a machine translation inference application which base on pytorch to gramine-sgx, for protect the translation process. After run in gramine-sgx successfully, i test the performance. But the tested performance data differ too much between non-gramine and gramine-sgx. I changed the configurations according to the official document, but it seems take no effective.
Steps to reproduce
No response
Expected results
No response
Actual results
No response
Gramine commit hash
1.3.1