Open CasellaJr opened 1 year ago
Hey @CasellaJr! Thanks a lot for raising this issue. I've taken some time to review the thread you have going with the Gramine team, and I think there's a few things we can try to get PyTorch to respect the thread limit / improve performance round over round. First, the docker environment variables you've set are not being passed to Gramine. This is by design. To set the number of threads in the enclave, you'll need to modify the gramine manifest. So instead of setting:
ENV OMP_NUM_THREADS=40
ENV OPENBLAS_NUM_THREADS=40
ENV MKL_NUM_THREADS=40
ENV VECLIB_MAXIMUM_THREADS=40
ENV NUMEXPR_NUM_THREADS=40
in the Dockerfile, you should add the following to the gramine manifest template:
loader.env.OMP_NUM_THREADS = "40"
loader.env.OPENBLAS_NUM_THREADS = "40"
loader.env.MKL_NUM_THREADS = "40"
loader.env.VECLIB_MAXIMUM_THREADS = "40"
loader.env.NUMEXPR_NUM_THREADS = "40"
Yesss, you are right! Thank you! I am fixing this, then I will run the experiment and tomorrow I will update you with the results. I will edit this post 😄
Hello @psfoley
I changed how you suggested the gramine manifest. So, in theory it is all ok (apart from Gramine that needs to be built with patched libgomp).
However, slowdown is always present. I attach here 4 txt files containing the logs of aggregator and 3 collaborators. As you can see, the problem in my opinion is collaborator. Indeed, the lines Waiting for tasks
of collaborators 1 and 3 increase round after round, because they are waiting for collaborator2...
aggregator.txt
collaborator1.txt
collaborator2.txt
collaborator3.txt
Machines are 4 (one for each entity) and exactly the same...
These are the specifics of the 4 machines:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 160
On-line CPU(s) list: 0-159
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz
CPU family: 6
Model: 106
Thread(s) per core: 2
Core(s) per socket: 40
Socket(s): 2
Stepping: 6
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 4600.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arc
h_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_en
hanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust sgx bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflus
hopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm i
da arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rd
pid sgx_lc fsrm md_clear pconfig flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 3.8 MiB (80 instances)
L1i: 2.5 MiB (80 instances)
L2: 100 MiB (80 instances)
L3: 120 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-39,80-119
NUMA node1 CPU(s): 40-79,120-159
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Srbds: Not affected
Tsx async abort: Not affected
Hello! Finally I completed an experiment in 3 hours more or less. Before the discussion with you and the Gramine team, this experiment was about 25 hours... because last rounds were around 30 minutes per round 🤣
Now, as you can see, it seems ok. There is still a slowdown. Maybe when I will use Gramine with patched libgomp I will fix also this. However, I think that I solved. Previous days results were not good because I think that one of the four machines was not working good, causing an heavy slowdown.
Thank you again.
Hello I am still doing FL experiments with Openfl-PyTorch-Gramine-SGX. As you can see in the previous image, there is still a drawdown during training that, in my opinion, is not addressed to Gramine. Here are my observations:
Simple Collaborator-Aggregator experiment with OpenFL: I found that also in this case, I have a slowdown. From 1 s/it to 2 s/it. This behavior is totally unpredictable: sometimes it happens after a few rounds in just only one collaborator, then, after another few rounds, it returns to 1 s/it, and so on. I am sure I am the only user using those machines, so resources are only allocated to me. I am doing FL experiments with 4 different machines: 1 for the aggregator and 3 for the collaborators, dataset MNIST. Slowdowns are random among the 3 collaborators. What I found is that if I fix here as environment variables the ones we were discussing in the previous posts, then the slowdown is mitigated: if for the first rounds training takes around 60 seconds per round, then it increases to 65-66 rounds after rounds. So, basically, also OpenFL (or PyTorch) are suffering from threads, but there is still some other problem that causes slowdown.
This slowdown is amplificated when using Gramine/SGX. Indeed, even if threads are fixed inside the Docker image and in every Python script, there is still this slowdown. However, in this case, it happens that it can be really problematic because I think Gramine amplifies a lot of this behavior. From 1.5s/it, it happens that sometimes it reaches 3-4s/it.
I know that is extremely complicated because the pattern is unpredictable and not reproducible. However, do you have some hints on what can be the problem? I think that it can be necessary an end-to-end debugging of OpenFL to spot what can be the cause of this behavior. Confidential computing can create more problems than it solves because it protects data, but the underlying technology can have heavy limitations. I think that in order to have a major adoption of confidential computing, these problems have to be eliminated... I know that is not an easy task...
Hi @CasellaJr! If this scenario is still relevant, please consider re-testing with the revamped graminized containers for OpenFL. In addition to the usability improvements (f.e. SGX-ready by default), they also include an updated version of gramine.
Hi community!
I am running secure FL experiments with OpenFL, Gramine and SGX. Problem: training time increases (a lot!) round after round. I am talking from about 3 minutes of the first round to more than 30 minutes after 100 rounds...
Solution: set the number of threads used during training (prevents from slowdown) and use Gramine with patched libgomp (better performance in general). This solution has been discussed deeply in this Gramine issue: link
Now, in order to do secure FL experiments on SGX, I am following your manual to run OpenFL with Docker and Gramine. However, the Docker process respects the number of threads that I decided. If I say 40 threads:
(yes, I know I know that is 35 and not 40, I do not know why). However, when I check the number of threads of openfl:
You can see that it changes very very frequently (every second)... Now, I have done this: Inside the Dockerfile.gramine I have set the following environment variables:
I have also put this inside the files containing training and validation functions such as
runner_pt.py
andpt_cnn.py
andPyTorchMNISTInMemory.py
:However, number of threads continue to change as I showed you before. What I am missing? Is there any other python script where can I set the number of threads?