Poor performance of pytorch in machine translation

sampleyang commented 1 year ago

Description of the problem

I try to move a machine translation inference application which base on pytorch to gramine-sgx, for protect the translation process. After run in gramine-sgx successfully, i test the performance. But the tested performance data differ too much between non-gramine and gramine-sgx. I changed the configurations according to the official document, but it seems take no effective.

Environment: 32C, 256G(128G normal memory + 128G encrypted memory)

Manifest:

#loader.env.OMP_NUM_THREADS = "8"
#sys.stack.size = "4M"
#loader.pal_internal_mem_size = "256M"
#libos.check_invalid_pointers = false
#sgx.preheat_enclave = true
sgx.nonpie_binary = true
gx.enclave_size = "64G"
sgx.thread_num = 128
sgx.remote_attestation = "dcap"
#sgx.require_avx = true
#sgx.require_avx512 = true
#sgx.insecure__rpc_thread_num = 8

Test Data(in container) Start Cost: Statistics of enclave load time RunCost: Statistics of application run time in enclave

	non-gramine	gramine-sgx	insecure__rpc_thread_num = 8	stack.size = "4M"pal_internal_mem_size = "256M"	check_invalid_pointers = falsepreheat_enclave = true	require_avx = truerequire_avx512 = true	OMP_NUM_THREADS = "8"
Start Cost(s)	<1s	64s	63s	63s	142s	63s	69s
Run Cost(s)	294s	495s	1002s	496s	450s	497s	737s
Total Cost(s)	295s	559s	1065s	559s	592s	560s	806s

# numactl --hardware
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 126294 MB
node 0 free: 124084 MB
node distances:
node   0 
  0:  10

Profiling It seems too many ocalls in gramine-sgx, especially ocall_futex. perf-stats.txt ocall_inner.txt ocall_outer.txt

Steps to reproduce

No response

Expected results

No response

Actual results

No response

Gramine commit hash

1.3.1

monavij commented 1 year ago

@sampleyang Why are you using exitless? Pytorch does not benefit much from using exitless, so don't use RPC threads and see if you get better results. I also see a lot of sgx_ocall_sched_yield calls in the ocall_outer log. I thought we did look into this problem last year where we make sched_yield a no-op. But i also see that the https://github.com/gramineproject/gramine/pull/213/ PR from Borys was not merged. @boryspoplawski @mkow any thoughts?

boryspoplawski commented 1 year ago

@monavij Making sched_yield a no-op gave like very minor performance improvements to one specific workload and slowed down super heavily most of the workloads

sampleyang commented 1 year ago

@sampleyang Why are you using exitless? Pytorch does not benefit much from using exitless, so don't use RPC threads and see if you get better results. I also see a lot of sgx_ocall_sched_yield calls in the ocall_outer log. I thought we did look into this problem last year where we make sched_yield a no-op. But i also see that the #213 PR from Borys was not merged. @boryspoplawski @mkow any thoughts?

@monavij I just tried almost all the methods in this doc "https://gramine.readthedocs.io/en/stable/performance.html", eixtless is just one of them. My table contains the performance without exitless, which seems is the best performance in gramine-sgx for my application.

For the start cost, maybe related with #683, currently gramine does not support sgx2 and edmm, and will initialize all enclave memory at once during startup, for the big enclave memory(64G) maybe cost a lot of time.

For the run cost, i am not sure if it is related with #853. From the profiling file(call_inner.txt) can see 'ocall_futex' cost the most time. PyTorch will use openMP for parallel computing, whether the performance will be affected when the number of threads(>= 16threads) reaches a certain level in gramine.

monavij commented 1 year ago

Did you use the Gramine patched OpenMP library? See a comment in our pytorch example - https://github.com/gramineproject/examples/blob/master/pytorch/pytorch.manifest.template

dimakuv commented 1 year ago

@sampleyang My suspicion is also that you're using a "vanilla" OpenMP library, which issues raw syscalls. In Gramine, we do have a workaround for this, though it's not amazing: https://github.com/gramineproject/examples/blob/553394fcee0f6f878bea19fadb0de6548d824f1a/pytorch/pytorch.manifest.template#L62-L70

But also, if I remember correctly, there are distributions of PyTorch that come with "improved" OpenMP library, like the Intel's OpenMP Runtime Library. See https://www.intel.com/content/www/us/en/developer/articles/technical/how-to-get-better-performance-on-pytorchcaffe2-with-intel-acceleration.html (search for libiomp5). These PyTorch distributions show much better performance under SGX.

sampleyang commented 1 year ago

@sampleyang My suspicion is also that you're using a "vanilla" OpenMP library, which issues raw syscalls. In Gramine, we do have a workaround for this, though it's not amazing: https://github.com/gramineproject/examples/blob/553394fcee0f6f878bea19fadb0de6548d824f1a/pytorch/pytorch.manifest.template#L62-L70

@dimakuv @monavij Thanks for your help. I can have a try about this two suggestions. And there are some questions with use . For 1st gramine patched OpenMP runtime library, where can i get the patched OpenMP and where i need to execute make -C LibOS gcc? I think LibOS is a directory before gramine v1.3, and now is libos, also there is no gcc directory under libos or LibOS, maybe the document is not the latest. Do you have a detail document i can follow.

dimakuv commented 1 year ago

@sampleyang You are correct about the first suggestion. This is a "bug" in the comment. I fixed it in this PR: https://github.com/gramineproject/examples/pull/45

Basically, you need to build Gramine something like this:

cd gramine/
meson setup build/ --buildtype=release -Ddirect=enabled -Dsgx=enabled -Dlibgomp=enabled
ninja -C build/
sudo ninja -C build/ install

Note the added -Dlibgomp=enabled flag! This is what builds the patched OpenMP library.

sampleyang commented 1 year ago

@sampleyang You are correct about the first suggestion. This is a "bug" in the comment. I fixed it in this PR: gramineproject/examples#45

Basically, you need to build Gramine something like this:
cd gramine/
meson setup build/ --buildtype=release -Ddirect=enabled -Dsgx=enabled -Dlibgomp=enabled
ninja -C build/
sudo ninja -C build/ install
Note the added -Dlibgomp=enabled flag! This is what builds the patched OpenMP library.

@dimakuv Thanks. For the 1st suggestion with Gramine patched OpenMP library, the performance for my application runtime improved, and also have about 47% performance degradation. The data as follow:

Pytorch 1.8.1	Non-Gramine	Gramine-SGX	Gramine-SGX with patched OpenMP
Start Cost(seconds)	<1s	63.22s	62.61s
Run Cost(seconds)	295.79s	497.71s（↓68.26%）	434.72s（↓46.97%）
Tost Cost(seconds)	296.23s	572.78s（↓93.36%）	509.50s（↓71.99%）

For the 2nd suggestion with intel improved PyTorch, i also have a try, but find some problem. My ops as follow for installation:

python3 -m pip install intel_extension_for_pytorch==1.11.0
python3 -m pip install oneccl_bind_pt==1.11.0 -f https://software.intel.com/ipex-whl-stable

After install I just follow the doc you applied. Because there is only 1 numa node on my machine, so i just jump the OpenMp section. First, I didn't find library libiomp*.so. I noticed that the doc was published at 2019, maybe it is not the latest. And I just set the env follow the doc and test the app without gramine-sgx, the performance didn't improve, but it decreased.

export OMP_NUM_THREADS=16
export MKL_NUM_THREADS=16
export LD_PRELOAD=/usr/local/lib/python3.8/dist-packages/torch/lib/libgomp.so.1  // this is the pytorch openmp lib
export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
export GOMP_CPU_AFFINITY="0-15"
export KMP_AFFINITY=granularity=fine,proclist=[0-15],explicit

Second, i find that the intel_extension_for_pytorch and oneccl_bind_pt didn't install any optimized OpenMP library. From the examples, it seems application need to use the sdk for model training and then inference. So I am not sure whether i need modify my application and model to fit intel improved PyTorch. I also find a models zoo on GitHub(https://github.com/IntelAI/models/tree/pytorch-r1.10-models) which seems there is an optimization coverage for the models. Maybe i need to modify a lot of things from training to inference for test this.

dimakuv commented 1 year ago

@sampleyang Thanks for the great summary! Really appreciate your detailed posts!

It's a pity you weren't able to use the "optimized Intel OpenMP library". To be honest, I last tested it a couple years ago, so maybe something changed there? Maybe others know more about how it works these days, and whether it gives significant performance boost: @svenkata9 @aneessahib @anjalirai-intel @jkr0103 .

svenkata9 commented 1 year ago

I haven't recently used OMP library. Can @sampleyang try the stuff listed here? https://github.com/gramineproject/examples/blob/master/openvino/README.md#performance-considerations

aneessahib commented 1 year ago

We have seen recent perf boost with OMP libs. @jkr0103 - can you detail the steps here for @sampleyang

jkr0103 commented 1 year ago

@sampleyang Could you try the optimized libraries I shared over email and report the results. If that doesn't help, could you share the sample for us to reproduce the issue locally if possible?

sampleyang commented 1 year ago

@jkr0103 Thanks. I have run my application with the library(libiomp5.so) you applied. And also notice that one configuration sys.brk.max_size=4G, so just make a comparison with the library(libgomp.so) @dimakuv applied. The data as follow:

Pytorch 1.8.1	Non-Gramine	Gramine-SGX	Gramine-SGX With Patched OpenMP	Gramine-SGX With Patched OpenMP and brk.max_size=4G	Gramine-SGX With libiomp5.so	Gramine-SGX With libiomp5.so and brk.max_size=4G
Start Cost(seconds)	<1s	63.22s	62.61s	62.34s	62.23s	63.47s
Run Cost(seconds)	295.79s	497.71s（↓68.26%）	434.72s（↓46.97%）	369.80s（↓25.02%）	365.76s（23.66↓%）	363.13s（↓22.77%）
Tost Cost(seconds)	296.23s	572.78s（↓93.36%）	509.50s（↓71.99%）	444.60s（↓50.09%）	440.38s（48.66↓%）	439.01s（↓48.20%）

it(Column 5) seems brk.max_size = 4G can further improve performance with the library gramine patched OpenMP @dimakuv applied.
The library libiomp5.so you applied(Column 6) also can improve performance effectively. But the performance improvement is not obvious after add brk.max_size = 4G(Column 7). Maybe the overhead 22%~25% for gramine runtime on my application has reached a limit for now?

Other configurations mentioned in the email：

libos.check_invalid_pointers = false caused a exception during startup.

[P1:T1:python3] debug: Allocated stack at 0x47bffd000 (size = 0x800000)
[P1:T1:python3] debug: loading "file://usr/bin/python3"
[P1:T1:python3] debug: find_interp: searching for interpreter: /lib/ld-linux-x86-64.so.2
[P1:T1:python3] debug: loading "file:/opt/ccp/gramine/lib/x86_64-linux-gnu/gramine/runtime/glibc/ld-linux-x86-64.so.2"
[P1:T1:python3] debug: execve: start execution
[P1:T1:python3] warning: Not supported flag (0x3001) passed to arch_prctl
[P1:T1:python3] debug: glibc register library /root/mts-ccp/lib_cpu_optimization/libiomp5.so loaded at 0x4aba00000
[P1:T1:python3] debug: glibc register library /root/mts-ccp/lib_cpu_optimization/libtcmalloc.so loaded at 0x4ab400000
[P1:T1:python3] debug: glibc register library /lib/libc.so.6 loaded at 0x4ab806000
[P1:T1:python3] debug: glibc register library /lib/libpthread.so.0 loaded at 0x4abec9000
[P1:T1:python3] debug: glibc register library /lib/libdl.so.2 loaded at 0x4abec2000
[P1:T1:python3] debug: glibc register library /lib/libutil.so.1 loaded at 0x4abebd000
[P1:T1:python3] debug: glibc register library /lib/libm.so.6 loaded at 0x4ab322000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libexpat.so.1 loaded at 0x4abe8f000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libz.so.1 loaded at 0x4abe73000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libgcc_s.so.1 loaded at 0x4abe58000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libstdc++.so.6 loaded at 0x4ab140000
[P1:T1:python3] warning: Unsupported system call rseq
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: Unsupported system call faccessat2
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_ctypes.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac5d000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libffi.so.7 loaded at 0x4abe06000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_queue.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac57000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_opcode.cpython-38-x86_64-linux-gnu.so loaded at 0x4ab801000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_bz2.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac4f000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libbz2.so.1.0 loaded at 0x4aabad000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_lzma.cpython-38-x86_64-linux-gnu.so loaded at 0x4aac43000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/liblzma.so.5 loaded at 0x4aab84000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_global_deps.so loaded at 0x4aa400000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libgomp-a34b3233.so.1 loaded at 0x4aa000000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/_C.cpython-38-x86_64-linux-gnu.so loaded at 0x4a9c00000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_python.so loaded at 0x4a8a00000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libshm.so loaded at 0x4a8600000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch.so loaded at 0x4a8200000
[P1:T1:python3] debug: glibc register library /lib/librt.so.1 loaded at 0x4aaafb000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libc10.so loaded at 0x4a7e00000
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/torch/lib/libtorch_cpu.so loaded at 0x496600000
[P1:T1:python3] error: Internal memory fault at 0x00000000 (0xffce66b9a, VMID = 1, TID = 1)
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1

sgx.preheat_enclave this will double the enclave load time to 130s.
/sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor my machine doesn't have.

dimakuv commented 1 year ago

The overhead (runtime cost) of ~25% sounds reasonable for a heavily-multithreaded workload.

I noticed that you compared Gramine-SGX with OMP_NUM_THREADS = "8" against the native with OMP_NUM_THREADS not set. In other words, you compared an 8-threads run of Gramine-SGX against a 32-threads run of native (32 threads because PyTorch spawns as many threads as there are CPU cores available, unless there is an explicit limit OMP_NUM_THREADS).

@sampleyang So another interesting experiment you could do is to try different OMP_NUM_THREADS, both in native (by prepending the command line with OMP_NUM_THREADS=8 python3 ...) and in Gramine-SGX (by adding the line loader.env.OMP_NUM_THREADS = "8" in the manifest file).

sampleyang commented 1 year ago

The overhead (runtime cost) of ~25% sounds reasonable for a heavily-multithreaded workload.

I noticed that you compared Gramine-SGX with OMP_NUM_THREADS = "8" against the native with OMP_NUM_THREADS not set. In other words, you compared an 8-threads run of Gramine-SGX against a 32-threads run of native (32 threads because PyTorch spawns as many threads as there are CPU cores available, unless there is an explicit limit OMP_NUM_THREADS).

@sampleyang So another interesting experiment you could do is to try different OMP_NUM_THREADS, both in native (by prepending the command line with OMP_NUM_THREADS=8 python3 ...) and in Gramine-SGX (by adding the line loader.env.OMP_NUM_THREADS = "8" in the manifest file).

@dimakuv I have tried this configuration before，OMP_NUM_THREADS = "8" or OMP_NUM_THREADS = "16" cannot improve performance effectively on my application, and it will reduce the performance. On my machine the native will be 16-threads for pytorch. There is a test data for OMP_NUM_THREADS = "8" in my first report.

For the library(libiomp5.so and libtcmalloc.so) applied by @jkr0103, I saw that sent to me via email. So the library is not a release version? or how i can get a official release version？if it is not release, maybe my best choice is gramine patched OpenMP library at present. These two libraries have almost the same impact on performance. Do you have any suggestions for this @dimakuv ?

dimakuv commented 1 year ago

OMP_NUM_THREADS = "8" or OMP_NUM_THREADS = "16" cannot improve performance effectively on my application, and it will reduce the performance.

Yes, I know this. But my point was -- in your table, you compared native (with 16 threads, according to your comment) run with a Gramine-SGX (with 8 threads) run. This is misleading, you can't compare two different configurations.

Anyway, this was just a remark. I don't see any quick ways to decrease your 25% overhead even further. You need to analyze performance bottlenecks (via Gramine's perf tooling) and maybe debug a bit.

sampleyang commented 1 year ago

OMP_NUM_THREADS = "8" or OMP_NUM_THREADS = "16" cannot improve performance effectively on my application, and it will reduce the performance.

Yes, I know this. But my point was -- in your table, you compared native (with 16 threads, according to your comment) run with a Gramine-SGX (with 8 threads) run. This is misleading, you can't compare two different configurations.

Anyway, this was just a remark. I don't see any quick ways to decrease your 25% overhead even further. You need to analyze performance bottlenecks (via Gramine's perf tooling) and maybe debug a bit.

Maybe my description is not clear enough above. The pytorch default use cpus/2 for the thread number . So on my machine without any configuration it is 16-threads for native default. And in my 1st report table, Non-Gramine Column means with native 16-threads, Gramine-SGX Column also means with native 16-thread in Gramine-SGX(Because default is 16), and OMP_NUM_THREADS = "8" means with 8-threads after setted, and i also test OMP_NUM_THREADS = "16" but didn't report.

The overhead (runtime cost) of ~25% for runtime is currently acceptable. Thanks for your help.

gc-fu commented 1 year ago

Hi, @jkr0103

I'm curious about the PyTorch performance in the SGX enclave. Currently, I'm using Gramine v1.3.1 with PyTorch 1.13.1.

I've also conducted the following experiment:

I tried to use the Intel OpenMP library to boost the performance of PyTorch training, which heavily utilizes OpenMP. However, when I added libiomp5.so to sgx.allowed_files and loader.env.LD_PRELOAD, and ran my application, I encountered the following error:

The environment variable setting:

export OMP_NUM_THREADS=32
export MKL_NUM_THREADS=32
export OMP_SCHEDULE=STATIC
export OMP_PROC_BIND=CLOSE
export GOMP_CPU_AFFINITY="1-32"
export KMP_AFFINITY=granularity=fine,proclist=[1-32],explicit

root@tdx-master:/ppml# gramine-sgx bash
Gramine is starting. Parsing TOML manifest file, this may take some time...
-----------------------------------------------------------------------------------------------------------------------
Gramine detected the following insecure configurations:

  - loader.insecure__use_host_env = true       (forwarding environment vars from untrusted host to the app)
  - sgx.file_check_policy = allow_all_but_log  (all files are passed through from untrusted host without verification)
  - sgx.allowed_files = [ ... ]                (some files are passed through from untrusted host without verification)

Gramine will continue application execution, but this configuration must not be used in production!
-----------------------------------------------------------------------------------------------------------------------

bash: warning: setlocale: LC_ALL: cannot change locale (C.UTF-8)
OMP: Error #179: Function Can't open SHM2 failed:
OMP: System error #2: No such file or directory

Here is the all-level log information:

[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/dev/shm/__KMP_REGISTERED_LIB_1_0", O_RDWR|O_CREAT|O_EXCL|0xa0000, 0666) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C.UTF-8/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C.UTF-8/LC_MESSAGES/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/share/locale/C/LC_MESSAGES/libiomp5.cat", O_RDONLY, 0000) = -2
[P1:T1:python3] trace: ---- sched_yield() = 0x0
[P1:T1:python3] trace: ---- write(2, 0x8de717090, 0x62) ...
OMP: Error #179: Function Can't open SHM2 failed:

Could you please help me understand whether this is a configuration error in my manifest file or whether we need a special version of libiomp5.so for Gramine?

I tried to profile my application, but encounter the following error after building gramine with commands

meson setup build/ --buildtype=debugoptimized -Dsgx=enabled -Dsgx_driver=dcap1.10 -Dlibgomp=enabled

root@tdx-master:/ppml# gramine-sgx bash
Gramine is starting. Parsing TOML manifest file, this may take some time...
error: sgx_profile_report_elf([vdso]): realpath failed
error: Initializing enclave failed: -1
error: load_enclave() failed with error -1

dimakuv commented 1 year ago

[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/dev/shm/__KMP_REGISTERED_LIB_1_0", O_RDWR|O_CREAT|O_EXCL|0xa0000, 0666) = -2

@gc-fu This is the problem in libiomp5.so. Gramine doesn't support shared memory (/dev/shm/, or SHM2). So the Intel OpenMP library fails.

But this is unexpected behavior to me. I don't remember that Intel OpenMP library requires shared memory.

Well, looks like OpenMP indeed got this support for shared memory, and I don't see how it can be disabled. Some links:

https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-8/supported-environment-variables.html (no envvar to disable shared memory found...)
https://isrc.iscas.ac.cn/gitlab/mirrors/github.com/llvm_llvm-project/-/commit/d6a0957467e86d5a87964d45fae18733e212c86f

Looks like you'll have to continue experimenting with this... I can't help here.

gc-fu commented 1 year ago

Thanks for your reply, I will try to see if I can find some workaround for using intel-OpenMP

gramineproject / gramine