Need to set total number of threads in OpenFL to prevent slowdown in SGX

Hi community!

I am running secure FL experiments with OpenFL, Gramine and SGX. Problem: training time increases (a lot!) round after round. I am talking from about 3 minutes of the first round to more than 30 minutes after 100 rounds...

Solution: set the number of threads used during training (prevents from slowdown) and use Gramine with patched libgomp (better performance in general). This solution has been discussed deeply in this Gramine issue: link

Now, in order to do secure FL experiments on SGX, I am following your manual to run OpenFL with Docker and Gramine. However, the Docker process respects the number of threads that I decided. If I say 40 threads:

ps -T -p 1875202 | wc
     35     175    1432

(yes, I know I know that is 35 and not 40, I do not know why). However, when I check the number of threads of openfl:

user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
     89     445    3646
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    140     700    5737
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    147     735    6024
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    107     535    4384
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    118     590    4835
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    142     710    5819
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    109     545    4466
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    130     650    5327
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    147     735    6024
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    147     735    6024
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    146     730    5983
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    122     610    4999
user1@atsnode9:~/bruno/mnist/3colmnist$ ps -T -p 857291 | wc
    111     555    4548

You can see that it changes very very frequently (every second)... Now, I have done this: Inside the Dockerfile.gramine I have set the following environment variables:

ENV OMP_NUM_THREADS=40
ENV OPENBLAS_NUM_THREADS=40
ENV MKL_NUM_THREADS=40
ENV VECLIB_MAXIMUM_THREADS=40
ENV NUMEXPR_NUM_THREADS=40

I have also put this inside the files containing training and validation functions such as runner_pt.py and pt_cnn.py and PyTorchMNISTInMemory.py:

os.environ["OMP_NUM_THREADS"] = "40" # export OMP_NUM_THREADS=40
os.environ["OPENBLAS_NUM_THREADS"] = "40" # export OPENBLAS_NUM_THREADS=40
os.environ["MKL_NUM_THREADS"] = "40" # export MKL_NUM_THREADS=40
os.environ["VECLIB_MAXIMUM_THREADS"] = "40" # export VECLIB_MAXIMUM_THREADS=40
os.environ["NUMEXPR_NUM_THREADS"] = "40" # export NUMEXPR_NUM_THREADS=40

However, number of threads continue to change as I showed you before. What I am missing? Is there any other python script where can I set the number of threads?

Hey @CasellaJr! Thanks a lot for raising this issue. I've taken some time to review the thread you have going with the Gramine team, and I think there's a few things we can try to get PyTorch to respect the thread limit / improve performance round over round. First, the docker environment variables you've set are not being passed to Gramine. This is by design. To set the number of threads in the enclave, you'll need to modify the gramine manifest. So instead of setting:

ENV OMP_NUM_THREADS=40
ENV OPENBLAS_NUM_THREADS=40
ENV MKL_NUM_THREADS=40
ENV VECLIB_MAXIMUM_THREADS=40
ENV NUMEXPR_NUM_THREADS=40

in the Dockerfile, you should add the following to the gramine manifest template:

loader.env.OMP_NUM_THREADS = "40"
loader.env.OPENBLAS_NUM_THREADS = "40"
loader.env.MKL_NUM_THREADS = "40"
loader.env.VECLIB_MAXIMUM_THREADS = "40"
loader.env.NUMEXPR_NUM_THREADS = "40"

Yesss, you are right! Thank you! I am fixing this, then I will run the experiment and tomorrow I will update you with the results. I will edit this post 😄

Hello @psfoley I changed how you suggested the gramine manifest. So, in theory it is all ok (apart from Gramine that needs to be built with patched libgomp). However, slowdown is always present. I attach here 4 txt files containing the logs of aggregator and 3 collaborators. As you can see, the problem in my opinion is collaborator. Indeed, the lines Waiting for tasks of collaborators 1 and 3 increase round after round, because they are waiting for collaborator2... aggregator.txt collaborator1.txt collaborator2.txt collaborator3.txt

Machines are 4 (one for each entity) and exactly the same...

These are the specifics of the 4 machines:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         52 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  160
  On-line CPU(s) list:   0-159
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  40
    Socket(s):           2
    Stepping:            6
    CPU max MHz:         3400.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4600.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arc
                         h_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1
                          sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_en
                         hanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflus
                         hopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
                         da arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rd
                         pid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   3.8 MiB (80 instances)
  L1i:                   2.5 MiB (80 instances)
  L2:                    100 MiB (80 instances)
  L3:                    120 MiB (2 instances)
NUMA:
  NUMA node(s):          2
  NUMA node0 CPU(s):     0-39,80-119
  NUMA node1 CPU(s):     40-79,120-159
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Hello! Finally I completed an experiment in 3 hours more or less. Before the discussion with you and the Gramine team, this experiment was about 25 hours... because last rounds were around 30 minutes per round 🤣

Now, as you can see, it seems ok. There is still a slowdown. Maybe when I will use Gramine with patched libgomp I will fix also this. However, I think that I solved. Previous days results were not good because I think that one of the four machines was not working good, causing an heavy slowdown.

Thank you again.

Hello I am still doing FL experiments with Openfl-PyTorch-Gramine-SGX. As you can see in the previous image, there is still a drawdown during training that, in my opinion, is not addressed to Gramine. Here are my observations:

Simple Collaborator-Aggregator experiment with OpenFL: I found that also in this case, I have a slowdown. From 1 s/it to 2 s/it. This behavior is totally unpredictable: sometimes it happens after a few rounds in just only one collaborator, then, after another few rounds, it returns to 1 s/it, and so on. I am sure I am the only user using those machines, so resources are only allocated to me. I am doing FL experiments with 4 different machines: 1 for the aggregator and 3 for the collaborators, dataset MNIST. Slowdowns are random among the 3 collaborators. What I found is that if I fix here as environment variables the ones we were discussing in the previous posts, then the slowdown is mitigated: if for the first rounds training takes around 60 seconds per round, then it increases to 65-66 rounds after rounds. So, basically, also OpenFL (or PyTorch) are suffering from threads, but there is still some other problem that causes slowdown.
This slowdown is amplificated when using Gramine/SGX. Indeed, even if threads are fixed inside the Docker image and in every Python script, there is still this slowdown. However, in this case, it happens that it can be really problematic because I think Gramine amplifies a lot of this behavior. From 1.5s/it, it happens that sometimes it reaches 3-4s/it.

I know that is extremely complicated because the pattern is unpredictable and not reproducible. However, do you have some hints on what can be the problem? I think that it can be necessary an end-to-end debugging of OpenFL to spot what can be the cause of this behavior. Confidential computing can create more problems than it solves because it protects data, but the underlying technology can have heavy limitations. I think that in order to have a major adoption of confidential computing, these problems have to be eliminated... I know that is not an easy task...

Hi @CasellaJr! If this scenario is still relevant, please consider re-testing with the revamped graminized containers for OpenFL. In addition to the usability improvements (f.e. SGX-ready by default), they also include an updated version of gramine.

securefederatedai / openfl

Need to set total number of threads in OpenFL to prevent slowdown in SGX #790