Closed Kanakoo closed 3 years ago
I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?
I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?
I run cat /proc/pid/status | grep Threads
and find It's using 49 threads without graphene (see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7388 root 20 0 23.572g 993216 133064 R 2933 0.4 2:43.90 python3
) and 51 threads with graphene-sgx (see:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
62493 root 20 0 32.299g 202800 34620 S 4720 0.1 177:41.91 loader
). In the manifest, sgx.thread_num is 512, 128 will cause Please try to increase sgx.thread_num in the manifest
error.
There are 2 physical CPUs, each has 24 cores, and there are 96 logical CPUs.
Here's a snippet of cat /proc/cpuinfo
:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping : 6
microcode : 0xd0002a0
cpu MHz : 2400.000
cache size : 36864 KB
physical id : 0
siblings : 48
core id : 0
cpu cores : 24
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips : 4800.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 57 bits virtual
power management:
...
processor : 95
vendor_id : GenuineIntel
cpu family : 6
model : 106
model name : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping : 6
microcode : 0xd0002a0
cpu MHz : 2400.000
cache size : 36864 KB
physical id : 1
siblings : 48
core id : 23
cpu cores : 24
apicid : 111
initial apicid : 111
fpu : yes
fpu_exception : yes
cpuid level : 27
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips : 4802.23
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 57 bits virtual
power management:
Also, /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor is set to "performance".
whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?
whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?
enclave size is 32G, 16G will cause Cannot allocate memory
error. Yes the larger enclave size is, the longer loading time will be, but 32G enclave size needs load for about 1 hour, is that normal?
Besides, graphene-direct works fine with the example, it runs as fast as without graphene.
it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.
it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.
So 18 times performance drop is very common? I've enabled preheat and used tcmalloc, also enabled performance mode, are there any other methods to optimize?
taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?
taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?
90 seconds is just one training step as below:
for batch_idx, (data, target) in enumerate(train_loader):
logger.info("start step")
start_time = time.time()
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = client_model(data)
loss = soft_dice_loss(output, target)
loss.backward()
optimizer.step()
end_time = time.time()
time_spend = end_time - start_time
logger.info("end step, time spend: %.3f s" % time_spend)
each loop takes about 90 seconds in sgx, but only 5 seconds without sgx. But from graphene-sgx starts to the training starts, it takes 1 hour, is that also possible?
taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?
Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX. @Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?
Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with numactl
). But I don't remember the exact invocation.
Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with
numactl
). But I don't remember the exact invocation.
Do you mean to run with command "numactl --cpunodebind=0 --membind=0 graphene-sgx python scripts/train.py"? I've tried it, and performance of each step improves by 10 seconds, see the logs:
2021-08-31 09:46:23,137 INFO __main__: Start training
2021-08-31 09:46:24,584 INFO __main__: start step
2021-08-31 09:46:55,982 INFO __main__: end step, time spend: 31.396 s
2021-08-31 09:46:56,234 INFO __main__: start step
2021-08-31 09:47:56,709 INFO __main__: end step, time spend: 60.472 s
2021-08-31 09:47:56,931 INFO __main__: start step
2021-08-31 09:49:12,775 INFO __main__: end step, time spend: 75.841 s
2021-08-31 09:49:13,015 INFO __main__: start step
2021-08-31 09:50:40,776 INFO __main__: end step, time spend: 87.758 s
2021-08-31 09:50:41,036 INFO __main__: start step
2021-08-31 09:52:04,008 INFO __main__: end step, time spend: 82.969 s
2021-08-31 09:52:04,240 INFO __main__: start step
2021-08-31 09:53:28,593 INFO __main__: end step, time spend: 84.350 s
2021-08-31 09:53:28,842 INFO __main__: start step
2021-08-31 09:54:53,361 INFO __main__: end step, time spend: 84.517 s
2021-08-31 09:54:53,609 INFO __main__: start step
2021-08-31 09:56:20,133 INFO __main__: end step, time spend: 86.521 s
...
It's still a huge performance drop compared to not using sgx. And after graphene starts, it still takes about 1 hour to start training.
Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX. @Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?
I only know DataLoader will start multiple processes when specifying num_workers > 0. But specifying multiple processes will cause "Function Not Implemented" error in graphene, so I've specified it to 0, see:
training_set = dataloader.KvasirDataset(DATA_PATH, collaborator_count=1, collaborator_num=0, is_validation=False)
train_loader = DataLoader(training_set, num_workers=0, batch_size=6, shuffle=True)
which means it only starts one process.
And for batch_idx, (data, target) in enumerate(train_loader):
runs fast as shown in logs.
@Kanakoo Could you run graphene-sgx
with trace log level (set loader.log_level = "trace"
in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.
we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.
we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.
Thanks Anees! I've seen that. Jitender found that the PRMRR size of my machine is too small, after I increase it to the maximum 64GB and add some optimizations including openmpi and libiomp5.so, the performance with sgx is even a little bit better than without sgx! see:
error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
/venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
2021-09-01 16:18:36,486 INFO __main__: Start training
2021-09-01 16:18:37,478 INFO __main__: start step
2021-09-01 16:18:50,832 INFO __main__: end step, time spend: 13.353 s
2021-09-01 16:18:50,928 INFO __main__: start step
2021-09-01 16:18:55,652 INFO __main__: end step, time spend: 4.724 s
2021-09-01 16:18:55,770 INFO __main__: start step
2021-09-01 16:18:58,442 INFO __main__: end step, time spend: 2.672 s
2021-09-01 16:18:58,530 INFO __main__: start step
2021-09-01 16:19:01,091 INFO __main__: end step, time spend: 2.561 s
2021-09-01 16:19:01,179 INFO __main__: start step
2021-09-01 16:19:03,866 INFO __main__: end step, time spend: 2.687 s
2021-09-01 16:19:03,956 INFO __main__: start step
2021-09-01 16:19:06,010 INFO __main__: end step, time spend: 2.053 s
2021-09-01 16:19:06,100 INFO __main__: start step
2021-09-01 16:19:08,168 INFO __main__: end step, time spend: 2.068 s
...
However, it still takes about 1 hour before the training starts, i mean between the log error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
and the log /venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning:...
, but on your machine, it didn't take so much time.
I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?
@Kanakoo Could you run
graphene-sgx
with trace log level (setloader.log_level = "trace"
in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.
I've ran with trace level and got the log. The whole log file is too large, about 1.7G, so i split 100M and compress it here: pytorch-log.zip
I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?
I've tried to disable preheat and found no effects.
@Kanakoo You can try couple of things - Please refer to this PR 2530 Before running the workload:
Linux systems have CPU frequency scaling governor that helps the system to scale the CPU frequency to achieve best performance or to save power based on the requirement. To achieve the best peformance, please set the CPU frequency scaling governor to performance mode.
for ((i=0; i<$(nproc); i++)); do echo 'performance' > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done
sgx.preheat_enclave = 1
to the manifest template.loader.env.LD_PRELOAD = "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
sgx.trusted_files.libtcmalloc = "file:/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
sgx.trusted_files.libunwind = "file:/usr/lib/x86_64-linux-gnu/libunwind.so.8"
loader.env.LD_PRELOAD = "/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
sgx.trusted_files.libmimalloc = "file:/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
@Kanakoo You can try couple of things - Please refer to this PR 2530
I've already applied these settings but found no effects.
Besides, I have discovered which part caused the extremely slow loading, that is from skimage import io
.
I wrote a simple example to test:
import logging
logging.basicConfig(level=logging.INFO,format='%(asctime)s %(levelname)s %(name)s: %(message)s')
logger = logging.getLogger(__name__)
logger.info("start import skimage.io")
from skimage import io
logger.info("finish import, start imread")
image = io.imread('/tcf/image.jpg')
logger.info("finish imread")
assert(image.shape[2] == 3)
and found that when running with graphene-sgx on bare metal, it took about 30 minutes to import, see:
2021-09-16 09:27:24,769 INFO __main__: start import skimage.io
2021-09-16 10:07:15,932 INFO __main__: finish import, start imread
2021-09-16 10:07:15,999 INFO __main__: finish imread
I was wondering if there was any problem with my python environment, so i packaged graphene to a image and ran in container. However, in container, when running with graphene-sgx, it took about 10 minutes to import, see:
graphene-worker | 2021-09-16 10:34:55,266 INFO __main__: start import skimage.io
graphene-worker | 2021-09-16 10:44:50,683 INFO __main__: finish import, start imread
graphene-worker | 2021-09-16 10:44:50,711 INFO __main__: finish imread
also, in container, when running with graphene-direct, it took about 30 seconds to import, see:
graphene-worker | 2021-09-16 10:32:38,944 INFO __main__: start import skimage.io
graphene-worker | 2021-09-16 10:33:06,027 INFO __main__: finish import, start imread
graphene-worker | 2021-09-16 10:33:06,054 INFO __main__: finish imread
in container, when running without graphene, that is, running with python3, it took about 1 second to import, see:
graphene-worker | 2021-09-16 10:32:04,014 INFO __main__: start import skimage.io
graphene-worker | 2021-09-16 10:32:04,997 INFO __main__: finish import, start imread
graphene-worker | 2021-09-16 10:32:05,021 INFO __main__: finish imread
There's also a huge performance drop between with and without sgx in container.
To reproduce it in container, here is the related file : graphene_docker.zip
unzip it to /root/graphene_docker cd /root/graphene_docker/pytorch-fl python3.6 -m venv venv source venv/bin/activate python3.6 -m pip install --upgrade pip python3.6 -m pip install scikit-image deactivate cd /root/graphene_docker docker-compose -f docker-compose-graphene.yaml build
To run with graphene-sgx: ./graphene-worker.sh start sgx To run with graphene-direct: ./graphene-worker.sh start direct To run without graphene: ./graphene-worker.sh start nographene To stop container: ./graphene-worker.sh stop
Based on my little analysis with strace, this module import seems to be heavy on IO operations (stat/read/ftsat) and hence the perf impact with Graphene-SGX.
@Kanakoo @svenkata9 Actually, what are the sizes of the *.manifest.template
, *.manifest
, and *.manifest.sgx
files for your application? I recently learned that people sometimes have manifest.sgx
files which are ~70MB in size (which obviously makes Graphene slow).
@dimakuv when running on bare metal, they are:
8.0K python.manifest
13M python.manifest.sgx
8.0K python.manifest.template
when running in container, I remake the exmaple, and they are:
8.0K python.manifest
1.3M python.manifest.sgx
8.0K python.manifest.template
Is 13M too large?
Is 13M too large?
No, this is a normal value. And 1.3M
in the container is a very good value.
So yes, the manifest-file size is not a reason for performance slowdown. I guess it's just the fact that import skimage.io
is just slow by itself because it performs a lot of file I/O (as @svenkata9 debugged). Unfortunately, there is probably not much that can be done here.
We could look into these performance bottlenecks, but this would be very low priority... We prioritize functionality issues in Graphene over performance for now.
Finally I found why "from skimage import io" is too slow in sgx, it indirectly calls ctypes to scan dynamic libraries and fork many processes. A workaround is adding these codes before from skimage import io
to disable the scanning.
# ctypes workaround for graphene
import ctypes.util
def fl(name):
raise PermissionError("No permission")
ctypes.util.find_library = fl
Seems this is resolved, closing.
Description of the problem
When running a unet pytorch training (use cpu) with graphene-sgx, there's a huge performance drop (about 18 times) compared to running with graphene-direct or without graphene.
Steps to reproduce
The issue was tested on latest version v1.2-rc1 and on ICX. To reproduce it:
Here is the example code: pytorch-fl.zip. unzip pytorch-fl.zip to graphene/Examples/pytorch-fl
Install dependencies:
run with graphene-sgx: install graphene v1.2-rc1 and make graphene with
make -j8 ISGX_DRIVER_PATH = "$(your_driver_path)" SGX=1
The output logs:
run with graphene-direct: make graphene with
make -j8
The output logs:
run outside graphene:
The output logs:
Expected results
After graphene is started, the training should start soon.
The performance of running with graphene-sgx is expected to be within 20% lower than that without sgx.
Actual results
So what's the reason of the bad performance? And how to optimize?