Huge performance drop when running pytorch training with graphene-sgx

Kanakoo commented 3 years ago

Description of the problem

When running a unet pytorch training (use cpu) with graphene-sgx, there's a huge performance drop (about 18 times) compared to running with graphene-direct or without graphene.

Steps to reproduce

The issue was tested on latest version v1.2-rc1 and on ICX. To reproduce it:

Here is the example code: pytorch-fl.zip. unzip pytorch-fl.zip to graphene/Examples/pytorch-fl

Install dependencies:

cd graphene/Examples/pytorch-fl
python3.6 -m venv venv
source venv/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install torchvision torch pillow scikit-image tqdm
deactivate

run with graphene-sgx: install graphene v1.2-rc1 and make graphene with make -j8 ISGX_DRIVER_PATH = "$(your_driver_path)" SGX=1

cd graphene/Examples/pytorch-fl
make PYTHONPATH=/usr PYTHONVERSION=python3.6 SGX=1
graphene-sgx python scripts/train.py

The output logs:

2021-08-26 10:57:58,518 INFO __main__: Start training
2021-08-26 10:57:59,751 INFO __main__: start step
2021-08-26 10:58:42,974 INFO __main__: end step, time spend: 43.220 s
2021-08-26 10:58:43,234 INFO __main__: start step
2021-08-26 10:59:47,293 INFO __main__: end step, time spend: 64.056 s
2021-08-26 10:59:47,602 INFO __main__: start step
2021-08-26 11:01:19,001 INFO __main__: end step, time spend: 91.396 s
2021-08-26 11:01:19,253 INFO __main__: start step
2021-08-26 11:02:51,807 INFO __main__: end step, time spend: 92.550 s
2021-08-26 11:02:52,122 INFO __main__: start step
2021-08-26 11:04:28,740 INFO __main__: end step, time spend: 96.614 s
2021-08-26 11:04:28,995 INFO __main__: start step
2021-08-26 11:06:06,286 INFO __main__: end step, time spend: 97.288 s
2021-08-26 11:06:06,587 INFO __main__: start step
2021-08-26 11:07:44,802 INFO __main__: end step, time spend: 98.212 s
2021-08-26 11:07:45,051 INFO __main__: start step
2021-08-26 11:09:23,343 INFO __main__: end step, time spend: 98.289 s
2021-08-26 11:09:23,667 INFO __main__: start step
2021-08-26 11:11:01,365 INFO __main__: end step, time spend: 97.695 s
...

run with graphene-direct: make graphene with make -j8

cd graphene/Examples/pytorch-fl
make PYTHONPATH=/usr PYTHONVERSION=python3.6
graphene-direct python scripts/train.py

The output logs:

2021-08-26 09:41:24,987 INFO __main__: Start training
2021-08-26 09:41:25,192 INFO __main__: start step
2021-08-26 09:41:29,154 INFO __main__: end step, time spend: 3.962 s
2021-08-26 09:41:29,232 INFO __main__: start step
2021-08-26 09:41:33,277 INFO __main__: end step, time spend: 4.046 s
2021-08-26 09:41:33,352 INFO __main__: start step
2021-08-26 09:41:37,429 INFO __main__: end step, time spend: 4.077 s
2021-08-26 09:41:37,502 INFO __main__: start step
2021-08-26 09:41:41,681 INFO __main__: end step, time spend: 4.179 s
2021-08-26 09:41:41,756 INFO __main__: start step
2021-08-26 09:41:45,904 INFO __main__: end step, time spend: 4.148 s
2021-08-26 09:41:46,050 INFO __main__: start step
2021-08-26 09:41:50,057 INFO __main__: end step, time spend: 4.007 s
2021-08-26 09:41:50,137 INFO __main__: start step
2021-08-26 09:41:53,964 INFO __main__: end step, time spend: 3.828 s
2021-08-26 09:41:54,039 INFO __main__: start step
2021-08-26 09:41:58,187 INFO __main__: end step, time spend: 4.148 s
2021-08-26 09:41:58,301 INFO __main__: start step
2021-08-26 09:42:02,346 INFO __main__: end step, time spend: 4.045 s
2021-08-26 09:42:02,442 INFO __main__: start step
2021-08-26 09:42:06,527 INFO __main__: end step, time spend: 4.084 s
...

run outside graphene:

cd graphene/Examples/pytorch-fl
ln -s $(pwd)/scripts /tcf
source venv/bin/activate
python3 scripts/train.py
deactivate

The output logs:

2021-08-26 17:45:25,516 INFO __main__: Start training
2021-08-26 17:45:25,639 INFO __main__: start step
2021-08-26 17:45:29,396 INFO __main__: end step, time spend: 3.757 s
2021-08-26 17:45:29,471 INFO __main__: start step
2021-08-26 17:45:33,713 INFO __main__: end step, time spend: 4.241 s
2021-08-26 17:45:33,824 INFO __main__: start step
2021-08-26 17:45:37,613 INFO __main__: end step, time spend: 3.788 s
2021-08-26 17:45:37,702 INFO __main__: start step
2021-08-26 17:45:41,628 INFO __main__: end step, time spend: 3.926 s
2021-08-26 17:45:41,721 INFO __main__: start step
2021-08-26 17:45:45,751 INFO __main__: end step, time spend: 4.030 s
2021-08-26 17:45:45,858 INFO __main__: start step
2021-08-26 17:45:50,380 INFO __main__: end step, time spend: 4.522 s
2021-08-26 17:45:50,475 INFO __main__: start step
2021-08-26 17:45:54,668 INFO __main__: end step, time spend: 4.193 s
2021-08-26 17:45:54,775 INFO __main__: start step
2021-08-26 17:45:59,003 INFO __main__: end step, time spend: 4.228 s
2021-08-26 17:45:59,104 INFO __main__: start step
2021-08-26 17:46:03,826 INFO __main__: end step, time spend: 4.722 s
2021-08-26 17:46:03,922 INFO __main__: start step
2021-08-26 17:46:08,465 INFO __main__: end step, time spend: 4.543 s
2021-08-26 17:46:08,566 INFO __main__: start step
2021-08-26 17:46:14,038 INFO __main__: end step, time spend: 5.472 s
2021-08-26 17:46:14,142 INFO __main__: start step
2021-08-26 17:46:18,560 INFO __main__: end step, time spend: 4.417 s
...

Expected results

After graphene is started, the training should start soon.
The performance of running with graphene-sgx is expected to be within 20% lower than that without sgx.

Actual results

When running with graphene-sgx, after graphene starts, it takes quite a long time (about 1 hour) before the training starts.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.

So what's the reason of the bad performance? And how to optimize？

mkow commented 3 years ago

I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?

Kanakoo commented 3 years ago

I'm not familiar with PyTorch, but could you check how many threads it's using with and without Graphene? Also, what's you CPU configuration?

I run cat /proc/pid/status | grep Threads and find It's using 49 threads without graphene (see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
7388 root      20   0 23.572g 993216 133064 R  2933  0.4   2:43.90 python3

) and 51 threads with graphene-sgx (see:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
62493 root      20   0 32.299g 202800  34620 S  4720  0.1 177:41.91 loader

). In the manifest, sgx.thread_num is 512, 128 will cause Please try to increase sgx.thread_num in the manifest error.

There are 2 physical CPUs, each has 24 cores, and there are 96 logical CPUs. Here's a snippet of cat /proc/cpuinfo:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping        : 6
microcode       : 0xd0002a0
cpu MHz         : 2400.000
cache size      : 36864 KB
physical id     : 0
siblings        : 48
core id         : 0
cpu cores       : 24
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 27
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips        : 4800.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 57 bits virtual
power management:
...
processor       : 95
vendor_id       : GenuineIntel
cpu family      : 6
model           : 106
model name      : Intel(R) Xeon(R) Gold 6336Y CPU $@ $@
stepping        : 6
microcode       : 0xd0002a0
cpu MHz         : 2400.000
cache size      : 36864 KB
physical id     : 1
siblings        : 48
core id         : 23
cpu cores       : 24
apicid          : 111
initial apicid  : 111
fpu             : yes
fpu_exception   : yes
cpuid level     : 27
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs pml ept_mode_based_exec tsc_scaling
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs
bogomips        : 4802.23
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 57 bits virtual
power management:

Also, /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor is set to "performance".

aneessahib commented 3 years ago

whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?

Kanakoo commented 3 years ago

whats your enclave size? There is an initial loading time for graphene-sgx depending on the enclave size. Also do you see the same problem with graphene-direct?

enclave size is 32G, 16G will cause Cannot allocate memory error. Yes the larger enclave size is, the longer loading time will be, but 32G enclave size needs load for about 1 hour, is that normal?

Besides, graphene-direct works fine with the example, it runs as fast as without graphene.

aneessahib commented 3 years ago

it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.

Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.

Kanakoo commented 3 years ago

it could be a cumulative addition depending on how test is written. But i was specifically referring to below statement which is quite possible.
Each training step takes about 5 seconds when running without sgx, but takes about 90 seconds when running with sgx. There is a drop of nearly 18 times.

So 18 times performance drop is very common? I've enabled preheat and used tcmalloc, also enabled performance mode, are there any other methods to optimize?

aneessahib commented 3 years ago

taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?

Kanakoo commented 3 years ago

taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?

90 seconds is just one training step as below:

for batch_idx, (data, target) in enumerate(train_loader):
    logger.info("start step")
    start_time = time.time()
    data, target = data.to(device), target.to(device)
    optimizer.zero_grad()
    output = client_model(data)
    loss = soft_dice_loss(output, target)
    loss.backward()
    optimizer.step()
    end_time = time.time()
    time_spend = end_time - start_time
    logger.info("end step, time spend: %.3f s" % time_spend)

each loop takes about 90 seconds in sgx, but only 5 seconds without sgx. But from graphene-sgx starts to the training starts, it takes 1 hour, is that also possible?

mkow commented 3 years ago

taking 90 seconds to start big enclaves is possible. was this properly measure to be 90 seconds?

Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX. @Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?

mkow commented 3 years ago

Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with numactl). But I don't remember the exact invocation.

Kanakoo commented 3 years ago

Also, one optimization you can do is to ensure that all the threads end up on the same physical CPU (with numactl). But I don't remember the exact invocation.

Do you mean to run with command "numactl --cpunodebind=0 --membind=0 graphene-sgx python scripts/train.py"? I've tried it, and performance of each step improves by 10 seconds, see the logs:

2021-08-31 09:46:23,137 INFO __main__: Start training
2021-08-31 09:46:24,584 INFO __main__: start step
2021-08-31 09:46:55,982 INFO __main__: end step, time spend: 31.396 s
2021-08-31 09:46:56,234 INFO __main__: start step
2021-08-31 09:47:56,709 INFO __main__: end step, time spend: 60.472 s
2021-08-31 09:47:56,931 INFO __main__: start step
2021-08-31 09:49:12,775 INFO __main__: end step, time spend: 75.841 s
2021-08-31 09:49:13,015 INFO __main__: start step
2021-08-31 09:50:40,776 INFO __main__: end step, time spend: 87.758 s
2021-08-31 09:50:41,036 INFO __main__: start step
2021-08-31 09:52:04,008 INFO __main__: end step, time spend: 82.969 s
2021-08-31 09:52:04,240 INFO __main__: start step
2021-08-31 09:53:28,593 INFO __main__: end step, time spend: 84.350 s
2021-08-31 09:53:28,842 INFO __main__: start step
2021-08-31 09:54:53,361 INFO __main__: end step, time spend: 84.517 s
2021-08-31 09:54:53,609 INFO __main__: start step
2021-08-31 09:56:20,133 INFO __main__: end step, time spend: 86.521 s
...

It's still a huge performance drop compared to not using sgx. And after graphene starts, it still takes about 1 hour to start training.

Kanakoo commented 3 years ago

Wait, is each step spawning a new process? If so, then 90s with this size is likely, but it sounds pretty bad to start so many processes, even without SGX. @Kanakoo: The snippet you pasted doesn't look like it does, but maybe something underneath these library calls is actually spawning processes?

I only know DataLoader will start multiple processes when specifying num_workers > 0. But specifying multiple processes will cause "Function Not Implemented" error in graphene, so I've specified it to 0, see:

training_set = dataloader.KvasirDataset(DATA_PATH, collaborator_count=1, collaborator_num=0, is_validation=False)
train_loader = DataLoader(training_set, num_workers=0, batch_size=6, shuffle=True)

which means it only starts one process. And for batch_idx, (data, target) in enumerate(train_loader): runs fast as shown in logs.

dimakuv commented 3 years ago

@Kanakoo Could you run graphene-sgx with trace log level (set loader.log_level = "trace" in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.

aneessahib commented 3 years ago

we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.

Kanakoo commented 3 years ago

we tried your workload at our end using the same manifest files. Only see about 20% of overhead. This is probably a set up issue at your end. will send you the details separately.

Thanks Anees! I've seen that. Jitender found that the PRMRR size of my machine is too small, after I increase it to the maximum 64GB and add some optimizations including openmpi and libiomp5.so, the performance with sgx is even a little bit better than without sgx! see:

error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production!
/venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
2021-09-01 16:18:36,486 INFO __main__: Start training
2021-09-01 16:18:37,478 INFO __main__: start step
2021-09-01 16:18:50,832 INFO __main__: end step, time spend: 13.353 s
2021-09-01 16:18:50,928 INFO __main__: start step
2021-09-01 16:18:55,652 INFO __main__: end step, time spend: 4.724 s
2021-09-01 16:18:55,770 INFO __main__: start step
2021-09-01 16:18:58,442 INFO __main__: end step, time spend: 2.672 s
2021-09-01 16:18:58,530 INFO __main__: start step
2021-09-01 16:19:01,091 INFO __main__: end step, time spend: 2.561 s
2021-09-01 16:19:01,179 INFO __main__: start step
2021-09-01 16:19:03,866 INFO __main__: end step, time spend: 2.687 s
2021-09-01 16:19:03,956 INFO __main__: start step
2021-09-01 16:19:06,010 INFO __main__: end step, time spend: 2.053 s
2021-09-01 16:19:06,100 INFO __main__: start step
2021-09-01 16:19:08,168 INFO __main__: end step, time spend: 2.068 s
...

However, it still takes about 1 hour before the training starts, i mean between the log error: Using insecure argv source. Graphene will continue application execution, but this configuration must not be used in production! and the log /venv/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning:..., but on your machine, it didn't take so much time.

mkow commented 3 years ago

I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?

Kanakoo commented 3 years ago

@Kanakoo Could you run graphene-sgx with trace log level (set loader.log_level = "trace" in your manifest file)? This will produce a lot of debug information. Redirect all the resulting output to some file, and attach this file here. So we can take a closer look at what Graphene tries to do, and where the performance bottleneck may come from.

I've ran with trace level and got the log. The whole log file is too large, about 1.7G, so i split 100M and compress it here: pytorch-log.zip

Kanakoo commented 3 years ago

I guess this is how much time preheat takes for such a huge enclaves? Can you try disabling it?

I've tried to disable preheat and found no effects.

svenkata9 commented 3 years ago

@Kanakoo You can try couple of things - Please refer to this PR 2530 Before running the workload:

Pre-System setting

Linux systems have CPU frequency scaling governor that helps the system to scale the CPU frequency to achieve best performance or to save power based on the requirement. To achieve the best peformance, please set the CPU frequency scaling governor to performance mode.

for ((i=0; i<$(nproc); i++)); do echo 'performance' > /sys/devices/system/cpu/cpu$i/cpufreq/scaling_governor; done

Performance considerations

Preheat manifest option pre-faults the enclave memory and moves the performance penalty to graphene-sgx invocation (before the workload starts executing). To use preheat option, add sgx.preheat_enclave = 1 to the manifest template.
TCMalloc and mimalloc are memory allocator libraries from Google and Microsoft that can help improve performance significantly based on the workloads. At any point, only one of these allocators can be used.
- TCMalloc (Please update the binary location and name if different from default)
- Install tcmalloc : sudo apt-get install google-perftools
- Add these in the manifest template
  loader.env.LD_PRELOAD = "/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
  sgx.trusted_files.libtcmalloc = "file:/usr/lib/x86_64-linux-gnu/libtcmalloc.so.4"
  sgx.trusted_files.libunwind = "file:/usr/lib/x86_64-linux-gnu/libunwind.so.8"
- Save the template and rebuild.
- mimalloc (Please update the binary location and name if different from default)
- Install mimalloc using the steps from https://github.com/microsoft/mimalloc
- Add these in the manifest template
  loader.env.LD_PRELOAD = "/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
  sgx.trusted_files.libmimalloc = "file:/usr/local/lib/mimalloc-1.7/libmimalloc.so.1.7"
- Save the template and rebuild.

Kanakoo commented 3 years ago

@Kanakoo You can try couple of things - Please refer to this PR 2530

I've already applied these settings but found no effects.

Besides, I have discovered which part caused the extremely slow loading, that is from skimage import io. I wrote a simple example to test:

import logging
logging.basicConfig(level=logging.INFO,format='%(asctime)s %(levelname)s %(name)s: %(message)s')
logger = logging.getLogger(__name__)
logger.info("start import skimage.io")

from skimage import io
logger.info("finish import, start imread")
image = io.imread('/tcf/image.jpg')
logger.info("finish imread")
assert(image.shape[2] == 3)

and found that when running with graphene-sgx on bare metal, it took about 30 minutes to import, see:

2021-09-16 09:27:24,769 INFO __main__: start import skimage.io
2021-09-16 10:07:15,932 INFO __main__: finish import, start imread
2021-09-16 10:07:15,999 INFO __main__: finish imread

I was wondering if there was any problem with my python environment, so i packaged graphene to a image and ran in container. However, in container, when running with graphene-sgx, it took about 10 minutes to import, see:

graphene-worker    | 2021-09-16 10:34:55,266 INFO __main__: start import skimage.io
graphene-worker    | 2021-09-16 10:44:50,683 INFO __main__: finish import, start imread
graphene-worker    | 2021-09-16 10:44:50,711 INFO __main__: finish imread

also, in container, when running with graphene-direct, it took about 30 seconds to import, see:

graphene-worker    | 2021-09-16 10:32:38,944 INFO __main__: start import skimage.io
graphene-worker    | 2021-09-16 10:33:06,027 INFO __main__: finish import, start imread
graphene-worker    | 2021-09-16 10:33:06,054 INFO __main__: finish imread

in container, when running without graphene, that is, running with python3, it took about 1 second to import, see:

graphene-worker    | 2021-09-16 10:32:04,014 INFO __main__: start import skimage.io
graphene-worker    | 2021-09-16 10:32:04,997 INFO __main__: finish import, start imread
graphene-worker    | 2021-09-16 10:32:05,021 INFO __main__: finish imread

There's also a huge performance drop between with and without sgx in container.

To reproduce it in container, here is the related file : graphene_docker.zip

unzip it to /root/graphene_docker cd /root/graphene_docker/pytorch-fl python3.6 -m venv venv source venv/bin/activate python3.6 -m pip install --upgrade pip python3.6 -m pip install scikit-image deactivate cd /root/graphene_docker docker-compose -f docker-compose-graphene.yaml build

To run with graphene-sgx: ./graphene-worker.sh start sgx To run with graphene-direct: ./graphene-worker.sh start direct To run without graphene: ./graphene-worker.sh start nographene To stop container: ./graphene-worker.sh stop

svenkata9 commented 3 years ago

Based on my little analysis with strace, this module import seems to be heavy on IO operations (stat/read/ftsat) and hence the perf impact with Graphene-SGX.

dimakuv commented 3 years ago

@Kanakoo @svenkata9 Actually, what are the sizes of the *.manifest.template, *.manifest, and *.manifest.sgx files for your application? I recently learned that people sometimes have manifest.sgx files which are ~70MB in size (which obviously makes Graphene slow).

Kanakoo commented 3 years ago

@dimakuv when running on bare metal, they are:

8.0K    python.manifest
13M     python.manifest.sgx
8.0K    python.manifest.template

when running in container, I remake the exmaple, and they are:

8.0K    python.manifest
1.3M    python.manifest.sgx
8.0K    python.manifest.template

Is 13M too large?

dimakuv commented 3 years ago

Is 13M too large?

No, this is a normal value. And 1.3M in the container is a very good value.

So yes, the manifest-file size is not a reason for performance slowdown. I guess it's just the fact that import skimage.io is just slow by itself because it performs a lot of file I/O (as @svenkata9 debugged). Unfortunately, there is probably not much that can be done here.

We could look into these performance bottlenecks, but this would be very low priority... We prioritize functionality issues in Graphene over performance for now.

qqingli commented 3 years ago

Finally I found why "from skimage import io" is too slow in sgx, it indirectly calls ctypes to scan dynamic libraries and fork many processes. A workaround is adding these codes before from skimage import io to disable the scanning.

# ctypes workaround for graphene
import ctypes.util
def fl(name):
    raise PermissionError("No permission")
ctypes.util.find_library = fl

mkow commented 3 years ago

Seems this is resolved, closing.

gramineproject / graphene