mreineck / ducc

Fork of https://gitlab.mpcdf.mpg.de/mtr/ducc to simplify external contributions
GNU General Public License v2.0
13 stars 12 forks source link

sht.experimental seems to not be able to calculate number of available hardware cores #23

Closed zatkins2 closed 7 months ago

zatkins2 commented 8 months ago

Hi Martin,

Apologies, I am long overdue on a response to my other question about lmax!

This is a bit odd, but it seems for me that ducc.sht.experimental functions can't seem to infer the number of available hardware cores if I pass nthreads=0. Passing nthreads=0 results in computation on 1 core. On the other hand, ducc.fft functions can do so.

I am using ducc 0.32 with a wheel built on my machine. The node has the following hardware:

zatkins@della8:~$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  1
Core(s) per socket:  16
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Gold 6246R CPU @ 3.40GHz
Stepping:            7
CPU MHz:             4100.000
CPU max MHz:         4100.0000
CPU min MHz:         1200.0000
BogoMIPS:            6800.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
mreineck commented 8 months ago

Hi Zach,

I just tested this by switching nthreads=1 to nthreads=0 in python/demos/sht_demo.py, and I see a clear increase in speed. Do you observe the problem in a specific function?

Ah, and is this happening in a code that also uses OpenMP? If so, there might be some subtle interaction with OpenMP's thread pinning ...

zatkins2 commented 8 months ago

I was using sht.experimental.adjoint_synthesis, but I imagine this shares functionality for determining the number of threads with the rest of the module. Passing nthreads=x, where x was anything >1, did properly use the requested number of threads, which is why I suspect the issue is related to determining the ducc0_max_threads (though I can't imagine why it wouldn't work here, but would for fft -- in that case, was using fft.r2c).

Hm I suppose I have only tested in a code or shell that previously has imported pixell.curvedsky, which does indeed import its own compiled modules using OpenMP (e.g. cmisc here). But the same goes for the fft.r2c call.

I should add that I am testing without having set the OMP_NUM_THREADS or DUCC0_NUM_THREADS environment variables, though I think from looking at ducc0_max_threads, that should still be fine when nthreads=0.

mreineck commented 8 months ago

Ah, I think I see the problem ... I might have been too clever for my own good in a place or two! Should be easy to fix; I'll let you know when I have an update.

mreineck commented 8 months ago

Should now be fixed on HEAD - thanks for reporting this!

zatkins2 commented 8 months ago

Nice!

mreineck commented 8 months ago

BTW, the functions in ducc0.sht.experimental are now also available in ducc0.sht, since I think they have sufficiently matured by now.

zatkins2 commented 8 months ago

Ah right, ok -- does that mean we should expect only sht functions (not sht.experimental functions) to be supported going forward?

The main way folks in ACT/SO are interfacing with ducc is through pixell.curvedsky which wraps the low-level ducc functions. Currently the wrappers point to sht.experimental, so we would need to update that and add a ducc>=0.33 dependency I guess.

mreineck commented 8 months ago

There's no need to hurry ... I expect to keep the functions around in experimental for at least another year.

mreineck commented 7 months ago

I released verion 0.33 yesterday. Could you please verify that this version uses the correct number of threads?

zatkins2 commented 7 months ago

It works as expected!