Closed herumi closed 4 years ago
I just test the latest version, I found you rewrite many implementations. I tested the version and found some operations in the latest version is slightly(5%) slower than the original version especially EcT::dbl and EcT::add.
Why? Maybe I miss something, for example, some compiler options?
Thank you for testing. Could you please tell me what OS, CPU, compiler, compile option and the old version?
Thanks for your reply. OS: Ubuntu18 Compiler: gcc7.3 CPU: 2* (Xeon 2.4G, 12core) The test code just uses 1 thread, and only use EcT::dbl and EcT::add。
I did not set any special compile option (only -DMCL_DONT_USE_OPENSSL) The old version is: commit 17d2b0c18dc501c7fc7c9ff311572e08a8699ded Author: MITSUNARI Shigeo herumi@nifty.com Date: Mon Jun 24 05:24:26 2019 +0900
[java] jni for BN254(old)
I rewrote a simple test code, looks like:
bool TestEccDbl(int64_t n) {
Tick tick(__FN__);
G1 g = G1Rand();
for (int64_t i = 0; i < n; ++i) {
G1::dbl(g,g);
}
return g != G1Zero(); // or maybe compiler would optimize the whole loop
}
bool TestEccAdd(int64_t n) {
Tick tick(__FN__);
G1 g = G1Rand();
G1 h = G1Rand();
for (int64_t i = 0; i < n; ++i) {
g = g + h;
}
return g != G1Zero();
}
bool Test(int64_t n) {
return TestEccDbl(n) && TestEccAdd(n);
}
The result of the test (n=10,000,000) is
For old version:
==> bool TestEccDbl <== bool TestEccDbl tick: 3142 ms ==> bool TestEccAdd <== bool TestEccAdd tick: 4225 ms
For latest version
==> bool TestEccDbl <== bool TestEccDbl tick: 3552 ms ==> bool TestEccAdd <== bool TestEccAdd tick: 4319 ms
Compile flags:
CXXFLAGS := \
-g3 \
-fPIC \
-std=c++17 \
-Wall \
-Wextra \
-gdwarf-2 \
-gstrict-dwarf \
-Wno-parentheses \
-Wdeprecated-declarations \
-fmerge-all-constants \
-march=native \
-mtune=native \
Thank you for the precise report. I have one more question. What do you use the curve parameter?
Oh, forgive me that I forgot the most important info. The curve is bn254.
I compared the master HEAD e0f7f5d with 17d2b0c with the following code, but it was also same. Could you try the test? My test environment: OS : Ubuntu 18.04.4 LTS CPU : Intel Core i7-7700 CPU @ 3.60GHz compiler : gcc 7.4.0
cd mcl
make clean && make lib/libmcl.a MCL_USE_OPENSSL=0
g++ -Ofast t.cpp -DNDEBUG -DMCL_DONT_USE_OPENSSL lib/libmcl.a -I ./include/ -lgmp -lgmpxx && repeat 5 ./a.out
master HEAD
add 1.073Kclk
dbl 577.64 clk
add 1.077Kclk
dbl 576.83 clk
add 1.079Kclk
dbl 577.76 clk
add 1.076Kclk
dbl 576.95 clk
add 1.075Kclk
dbl 575.63 clk
17d2b0c
add 1.064Kclk
dbl 574.55 clk
add 1.074Kclk
dbl 578.40 clk
add 1.075Kclk
dbl 578.15 clk
add 1.075Kclk
dbl 578.76 clk
add 1.079Kclk
dbl 580.67 clk
t.cpp
#include <mcl/bn256.hpp>
#include <cybozu/benchmark.hpp>
using namespace mcl::bn;
int main()
try
{
const int N = 10000000;
initPairing();
G1 P, Q;
hashAndMapToG1(P, "abc", 3);
P += P;
G1::dbl(Q, P);
CYBOZU_BENCH_C("add", N, G1::add, P, Q, P);
CYBOZU_BENCH_C("dbl", N, G1::dbl, P, P);
} catch (std::exception& e) {
printf("err %s\n", e.what());
return 1;
}
It was really weird. 17d2b0c
add 1.388Kclk dbl 750.99 clk add 1.386Kclk dbl 747.80 clk add 1.393Kclk dbl 755.13 clk add 1.396Kclk dbl 755.55 clk add 1.355Kclk dbl 739.97 clk
HEAD
add 1.688Kclk dbl 845.51 clk add 1.675Kclk dbl 838.87 clk add 1.662Kclk dbl 830.21 clk add 1.653Kclk dbl 840.71 clk add 1.684Kclk dbl 845.45 clk
Should I use MCL_USE_LLVM or MCL_USE_XBYAK? What's the difference between them? And what if both of them are not defined?
I tried the other PC and got the same result. Xeon Platinum 8280 CPU 2.70GHz + gcc 9.9.1 + Ubuntu 19.10
HEAD
add 804.65 clk
dbl 429.93 clk
add 800.06 clk
dbl 426.03 clk
add 804.36 clk
dbl 429.83 clk
add 801.26 clk
dbl 430.40 clk
add 807.46 clk
dbl 429.84 clk
17d2b0c
add 825.81 clk
dbl 437.32 clk
add 811.44 clk
dbl 428.79 clk
add 811.53 clk
dbl 429.48 clk
add 811.97 clk
dbl 431.40 clk
add 812.47 clk
dbl 429.57 clk
The difference of add/dbl between HEAD and 17d2b0c of include/mcl/ec.hpp is small, then I don't think it affects speed.
git diff 17d2b0c include/mcl/ec.hpp
I found your score is much larger than the above. Do you change some system parameters? (e.g., is selinux enable?, sudo cpupower frequency-info ?)
MCL_USE_XBYAK is automatically defined in op.hpp. MCL_USE_LLVM is used only when libmcl.a is built.
On x64 architecture, mcl uses functions written in Xbyak if possible. If not so(i.e., can't use JIT), then mcl uses functions generated by LLVM.
I found something.
I found I did not enable -DMCL_USE_LLVM when I compile the mcl
(I rewrote a makefile and compile the mcl
files with some other cpp
files into one binary and miss the -DMCL_USE_LLVM) so that do not link the files in asm
folder.
In such a situation, I guess when mcl
wants to use Xbyak, and maybe for some reason (system security policy?), it fails, and there are not asm
linked, so it back to ordinary cpp
implementations (I guess there exist some differences in this part).
It can also explain why my score is much larger than yours.
Would you please tell me how can I know which code works when running? Xbyak or LLVM or c code?
Would you please tell me how can I know which code works when running? Xbyak or LLVM or c code?
mcl::fp::isEnableJIT() returns true if JIT is available. cf. https://github.com/herumi/mcl/blob/master/test/bn_test.cpp#L429
Changing #if 0
to #if 1
at https://github.com/herumi/mcl/blob/master/src/fp.cpp#L414 shows what is used.
Could you tell me some extra information?
cat /proc/cpuinfo
sudo cpupower frequency-info
sudo perf stat ./a.out ; binary of the above t.cpp
add 823.16 clk
dbl 443.83 clk
Performance counter stats for './a.out':
3970.269595 task-clock (msec) # 1.000 CPUs utilized
3 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
151 page-faults # 0.038 K/sec
18176581787 cycles # 4.578 GHz
47289392507 instructions # 2.60 insn per cycle
1841608367 branches # 463.850 M/sec
49214 branch-misses # 0.00% of all branches
3.970464078 seconds time elapsed
cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping : 10
microcode : 0xca
cpu MHz : 1864.658
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 0
cpu cores : 6
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7392.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping : 10
microcode : 0xca
cpu MHz : 908.111
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 1
cpu cores : 6
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7392.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping : 10
microcode : 0xca
cpu MHz : 840.091
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 2
cpu cores : 6
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7392.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping : 10
microcode : 0xca
cpu MHz : 1558.815
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 3
cpu cores : 6
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7392.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping : 10
microcode : 0xca
cpu MHz : 3148.036
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 4
cpu cores : 6
apicid : 8
initial apicid : 8
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7392.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping : 10
microcode : 0xca
cpu MHz : 1918.025
cache size : 12288 KB
physical id : 0
siblings : 6
core id : 5
cpu cores : 6
apicid : 10
initial apicid : 10
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips : 7392.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
sudo cpupower frequency-info
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: Cannot determine or is not supported.
hardware limits: 800 MHz - 4.70 GHz
available cpufreq governors: performance powersave
current policy: frequency should be within 800 MHz and 4.70 GHz.
The governor "powersave" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.87 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes
sudo perf stat ./a.out
add 959.22 clk
dbl 514.75 clk
Performance counter stats for './a.out':
3990.398013 task-clock (msec) # 1.000 CPUs utilized
7 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
147 page-faults # 0.037 K/sec
18,301,483,155 cycles # 4.586 GHz
47,260,035,463 instructions # 2.58 insn per cycle
1,811,699,201 branches # 454.015 M/sec
50,901 branch-misses # 0.00% of all branches
3.990916122 seconds time elapsed
add 959.22 clk dbl 514.75 clk
These values seem to be different from those reported at https://github.com/herumi/mcl/issues/31#issuecomment-589753307
Has something changed?
add 959.22 clk dbl 514.75 clk
These values seem to be different from those reported at #31 (comment)
Has something changed?
I changed the test computer because the original computer is a server which I can not sudo.
Let me do some more digging in this new computer, I found maybe I made some mistakes.
I measured the add/dbl with mcl::BN_SNARK1 for the master and 17d2b0c and got the bellow results:
repeat 5 ./a.out
add 1.054Kclk
dbl 565.28 clk
add 1.045Kclk
dbl 563.78 clk
add 1.054Kclk
dbl 565.86 clk
add 1.047Kclk
dbl 562.32 clk
add 1.054Kclk
dbl 565.57 clk
May I close this issue?
https://github.com/herumi/mcl/issues/30#issuecomment-423969417