herumi / mcl

a portable and fast pairing-based cryptography library
BSD 3-Clause "New" or "Revised" License
452 stars 152 forks source link

multiexp #31

Closed herumi closed 4 years ago

herumi commented 5 years ago

https://github.com/herumi/mcl/issues/30#issuecomment-423969417

herumi commented 4 years ago

https://github.com/herumi/mcl/commit/3d619e0055cda458e6270ae7a943ccf490230e18

huyuguang commented 4 years ago

I just test the latest version, I found you rewrite many implementations. I tested the version and found some operations in the latest version is slightly(5%) slower than the original version especially EcT::dbl and EcT::add.

Why? Maybe I miss something, for example, some compiler options?

herumi commented 4 years ago

Thank you for testing. Could you please tell me what OS, CPU, compiler, compile option and the old version?

huyuguang commented 4 years ago

Thanks for your reply. OS: Ubuntu18 Compiler: gcc7.3 CPU: 2* (Xeon 2.4G, 12core) The test code just uses 1 thread, and only use EcT::dbl and EcT::add。

I did not set any special compile option (only -DMCL_DONT_USE_OPENSSL) The old version is: commit 17d2b0c18dc501c7fc7c9ff311572e08a8699ded Author: MITSUNARI Shigeo herumi@nifty.com Date: Mon Jun 24 05:24:26 2019 +0900

[java] jni for BN254(old)
huyuguang commented 4 years ago

I rewrote a simple test code, looks like:

bool TestEccDbl(int64_t n) {
  Tick tick(__FN__);
  G1 g = G1Rand();
  for (int64_t i = 0; i < n; ++i) {
    G1::dbl(g,g);
  }
  return g != G1Zero(); // or maybe compiler would optimize the whole loop
}

bool TestEccAdd(int64_t n) {
  Tick tick(__FN__);
  G1 g = G1Rand();
  G1 h = G1Rand();
  for (int64_t i = 0; i < n; ++i) {
    g = g + h;
  }
  return g != G1Zero();
}
bool Test(int64_t n) {
  return TestEccDbl(n) && TestEccAdd(n);
}

The result of the test (n=10,000,000) is

For old version:

==> bool TestEccDbl <== bool TestEccDbl tick: 3142 ms ==> bool TestEccAdd <== bool TestEccAdd tick: 4225 ms

For latest version

==> bool TestEccDbl <== bool TestEccDbl tick: 3552 ms ==> bool TestEccAdd <== bool TestEccAdd tick: 4319 ms

huyuguang commented 4 years ago

Compile flags:

CXXFLAGS := \
 -g3 \
 -fPIC \
 -std=c++17 \
 -Wall \
 -Wextra \
 -gdwarf-2 \
 -gstrict-dwarf \
 -Wno-parentheses \
 -Wdeprecated-declarations \
 -fmerge-all-constants  \
 -march=native \
 -mtune=native \
herumi commented 4 years ago

Thank you for the precise report. I have one more question. What do you use the curve parameter?

huyuguang commented 4 years ago

Oh, forgive me that I forgot the most important info. The curve is bn254.

herumi commented 4 years ago

I compared the master HEAD e0f7f5d with 17d2b0c with the following code, but it was also same. Could you try the test? My test environment: OS : Ubuntu 18.04.4 LTS CPU : Intel Core i7-7700 CPU @ 3.60GHz compiler : gcc 7.4.0

cd mcl
make clean && make lib/libmcl.a MCL_USE_OPENSSL=0
g++ -Ofast t.cpp -DNDEBUG -DMCL_DONT_USE_OPENSSL lib/libmcl.a -I ./include/ -lgmp -lgmpxx && repeat 5 ./a.out

master HEAD

add   1.073Kclk
dbl 577.64 clk
add   1.077Kclk
dbl 576.83 clk
add   1.079Kclk
dbl 577.76 clk
add   1.076Kclk
dbl 576.95 clk
add   1.075Kclk
dbl 575.63 clk

17d2b0c

add   1.064Kclk
dbl 574.55 clk
add   1.074Kclk
dbl 578.40 clk
add   1.075Kclk
dbl 578.15 clk
add   1.075Kclk
dbl 578.76 clk
add   1.079Kclk
dbl 580.67 clk

t.cpp

#include <mcl/bn256.hpp>
#include <cybozu/benchmark.hpp>
using namespace mcl::bn;
int main()
    try
{
    const int N = 10000000;
    initPairing();
    G1 P, Q;
    hashAndMapToG1(P, "abc", 3);
    P += P;
    G1::dbl(Q, P);
    CYBOZU_BENCH_C("add", N, G1::add, P, Q, P);
    CYBOZU_BENCH_C("dbl", N, G1::dbl, P, P);
} catch (std::exception& e) {
    printf("err %s\n", e.what());
    return 1;
}
huyuguang commented 4 years ago

It was really weird. 17d2b0c

add 1.388Kclk dbl 750.99 clk add 1.386Kclk dbl 747.80 clk add 1.393Kclk dbl 755.13 clk add 1.396Kclk dbl 755.55 clk add 1.355Kclk dbl 739.97 clk

HEAD

add 1.688Kclk dbl 845.51 clk add 1.675Kclk dbl 838.87 clk add 1.662Kclk dbl 830.21 clk add 1.653Kclk dbl 840.71 clk add 1.684Kclk dbl 845.45 clk

huyuguang commented 4 years ago

Should I use MCL_USE_LLVM or MCL_USE_XBYAK? What's the difference between them? And what if both of them are not defined?

herumi commented 4 years ago

I tried the other PC and got the same result. Xeon Platinum 8280 CPU 2.70GHz + gcc 9.9.1 + Ubuntu 19.10

HEAD

add 804.65 clk
dbl 429.93 clk
add 800.06 clk
dbl 426.03 clk
add 804.36 clk
dbl 429.83 clk
add 801.26 clk
dbl 430.40 clk
add 807.46 clk
dbl 429.84 clk

17d2b0c

add 825.81 clk
dbl 437.32 clk
add 811.44 clk
dbl 428.79 clk
add 811.53 clk
dbl 429.48 clk
add 811.97 clk
dbl 431.40 clk
add 812.47 clk
dbl 429.57 clk

The difference of add/dbl between HEAD and 17d2b0c of include/mcl/ec.hpp is small, then I don't think it affects speed.

git diff 17d2b0c include/mcl/ec.hpp

I found your score is much larger than the above. Do you change some system parameters? (e.g., is selinux enable?, sudo cpupower frequency-info ?)

MCL_USE_XBYAK is automatically defined in op.hpp. MCL_USE_LLVM is used only when libmcl.a is built.

On x64 architecture, mcl uses functions written in Xbyak if possible. If not so(i.e., can't use JIT), then mcl uses functions generated by LLVM.

huyuguang commented 4 years ago

I found something.

I found I did not enable -DMCL_USE_LLVM when I compile the mcl (I rewrote a makefile and compile the mcl files with some other cpp files into one binary and miss the -DMCL_USE_LLVM) so that do not link the files in asm folder.

In such a situation, I guess when mcl wants to use Xbyak, and maybe for some reason (system security policy?), it fails, and there are not asm linked, so it back to ordinary cpp implementations (I guess there exist some differences in this part).

It can also explain why my score is much larger than yours.

huyuguang commented 4 years ago

Would you please tell me how can I know which code works when running? Xbyak or LLVM or c code?

herumi commented 4 years ago

Would you please tell me how can I know which code works when running? Xbyak or LLVM or c code?

mcl::fp::isEnableJIT() returns true if JIT is available. cf. https://github.com/herumi/mcl/blob/master/test/bn_test.cpp#L429

Changing #if 0 to #if 1 at https://github.com/herumi/mcl/blob/master/src/fp.cpp#L414 shows what is used.

herumi commented 4 years ago

Could you tell me some extra information?

cat /proc/cpuinfo
sudo cpupower frequency-info
sudo perf stat ./a.out ; binary of the above t.cpp
add 823.16 clk
dbl 443.83 clk

 Performance counter stats for './a.out':

       3970.269595      task-clock (msec)         #    1.000 CPUs utilized
                 3      context-switches          #    0.001 K/sec
                 0      cpu-migrations            #    0.000 K/sec
               151      page-faults               #    0.038 K/sec
       18176581787      cycles                    #    4.578 GHz
       47289392507      instructions              #    2.60  insn per cycle
        1841608367      branches                  #  463.850 M/sec
             49214      branch-misses             #    0.00% of all branches

       3.970464078 seconds time elapsed
huyuguang commented 4 years ago

cat /proc/cpuinfo

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping    : 10
microcode   : 0xca
cpu MHz     : 1864.658
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 0
cpu cores   : 6
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7392.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping    : 10
microcode   : 0xca
cpu MHz     : 908.111
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 1
cpu cores   : 6
apicid      : 2
initial apicid  : 2
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7392.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 2
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping    : 10
microcode   : 0xca
cpu MHz     : 840.091
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 2
cpu cores   : 6
apicid      : 4
initial apicid  : 4
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7392.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 3
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping    : 10
microcode   : 0xca
cpu MHz     : 1558.815
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 3
cpu cores   : 6
apicid      : 6
initial apicid  : 6
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7392.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 4
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping    : 10
microcode   : 0xca
cpu MHz     : 3148.036
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 4
cpu cores   : 6
apicid      : 8
initial apicid  : 8
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7392.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor   : 5
vendor_id   : GenuineIntel
cpu family  : 6
model       : 158
model name  : Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
stepping    : 10
microcode   : 0xca
cpu MHz     : 1918.025
cache size  : 12288 KB
physical id : 0
siblings    : 6
core id     : 5
cpu cores   : 6
apicid      : 10
initial apicid  : 10
fpu     : yes
fpu_exception   : yes
cpuid level : 22
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs        : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit
bogomips    : 7392.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

sudo cpupower frequency-info

analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 800 MHz - 4.70 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 800 MHz and 4.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 3.87 GHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

sudo perf stat ./a.out

add 959.22 clk
dbl 514.75 clk

 Performance counter stats for './a.out':

       3990.398013      task-clock (msec)         #    1.000 CPUs utilized          
                 7      context-switches          #    0.002 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               147      page-faults               #    0.037 K/sec                  
    18,301,483,155      cycles                    #    4.586 GHz                    
    47,260,035,463      instructions              #    2.58  insn per cycle         
     1,811,699,201      branches                  #  454.015 M/sec                  
            50,901      branch-misses             #    0.00% of all branches        

       3.990916122 seconds time elapsed
herumi commented 4 years ago

add 959.22 clk dbl 514.75 clk

These values seem to be different from those reported at https://github.com/herumi/mcl/issues/31#issuecomment-589753307

Has something changed?

huyuguang commented 4 years ago

add 959.22 clk dbl 514.75 clk

These values seem to be different from those reported at #31 (comment)

Has something changed?

I changed the test computer because the original computer is a server which I can not sudo.

Let me do some more digging in this new computer, I found maybe I made some mistakes.

herumi commented 4 years ago

I measured the add/dbl with mcl::BN_SNARK1 for the master and 17d2b0c and got the bellow results:

repeat 5 ./a.out
add   1.054Kclk
dbl 565.28 clk
add   1.045Kclk
dbl 563.78 clk
add   1.054Kclk
dbl 565.86 clk
add   1.047Kclk
dbl 562.32 clk
add   1.054Kclk
dbl 565.57 clk

May I close this issue?