h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 154 forks source link

fread in parallel mode, multi-core performance decreases, when the number of cores exceeds 20 #2262

Open chi2liu opened 4 years ago

chi2liu commented 4 years ago

During the fread operation, the performance decreases significantly in the case of more than 20 cores. According to Amdahl's law, the higher the degree of parallelism, the performance may stabilize, but why is there a significant decrease? This will result in the default nthreads setting, on some machines with more than 20 cores, the best performance will not be achieved, but a poor performance.

Machine environment: logical core 40 cores, physical core 20 cores, hyper-threading turned on

data size is 10g, yellow_tripdata csv

[Performance data is the cumulative time of ten runs] nthreads: 1, performance: 360.42344781407155 nthreads: 4, performance: 100.93926633079536 nthreads: 8, performance: 59.12695063999854 nthreads: 16, performance: 39.5492869140580 nthreads: 20, performance: 37.37294516689144 nthreads: 28, performance: 93.49912556796335 nthreads: 32, performance: 293.03483925387263 nthreads: 40, performance: slower than single thread

Machine environment: logical core 40 cores, physical core 20 cores, hyper-threading turned on

processor : 39 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz stepping : 1 microcode : 0xb000036 cpu MHz : 2394.495 cache size : 25600 KB physical id : 1 siblings : 20 core id : 12 cpu cores : 10 apicid : 57 initial apicid : 57 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes

st-pasha commented 4 years ago

I tried to replicate this experiment, and here are the results that I've observed:

nthreads time
1 15.56
2 9.44
3 6.43
4 5.21
5 4.34
6 3.69
8 2.93
10 2.59
12 2.17
14 1.96
16 1.76
20 1.62
24 1.59
28 1.46
32 1.49
36 1.59
38 1.56
40 1.55

So, after 20 threads there is neither any benefit nor harm in using extra threads. This is all readin an 8GB file on a machine with the same CPU as yours:

vendor_id: GenuineIntel
brand: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hz_advertised: 2.2000 GHz
hz_actual: 1.2373 GHz
hz_advertised_raw: [2200000000, 0]
hz_actual_raw: [1237284000, 0]
stepping: 1
model: 79
family: 6
flags: ['3dnowprefetch', 'abm', 'acpi', 'adx', 'aes', 'aperfmperf', 'apic', 'arat', 'arch_perfmon', 'avx', 'avx2', 'bmi1', 'bmi2', 'bts', 'cat_l3', 'cdp_l3', 'clflush', 'cmov', 'constant_tsc', 'cpuid', 'cpuid_fault', 'cqm', 'cqm_llc', 'cqm_mbm_local', 'cqm_mbm_total', 'cqm_occup_llc', 'cx16', 'cx8', 'dca', 'de', 'ds_cpl', 'dtes64', 'dtherm', 'dts', 'epb', 'ept', 'erms', 'est', 'f16c', 'flexpriority', 'flush_l1d', 'fma', 'fpu', 'fsgsbase', 'fxsr', 'hle', 'ht', 'ibpb', 'ibrs', 'ida', 'intel_ppin', 'intel_pt', 'invpcid', 'invpcid_single', 'lahf_lm', 'lm', 'mca', 'mce', 'md_clear', 'mmx', 'monitor', 'movbe', 'msr', 'mtrr', 'nonstop_tsc', 'nopl', 'nx', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pdpe1gb', 'pebs', 'pge', 'pln', 'pni', 'popcnt', 'pqe', 'pqm', 'pse', 'pse36', 'pti', 'pts', 'rdrand', 'rdrnd', 'rdseed', 'rdt_a', 'rdtscp', 'rep_good', 'rtm', 'sdbg', 'sep', 'smap', 'smep', 'smx', 'ss', 'ssbd', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'stibp', 'syscall', 'tm', 'tm2', 'tpr_shadow', 'tsc', 'tsc_adjust', 'tsc_deadline_timer', 'tscdeadline', 'vme', 'vmx', 'vnmi', 'vpid', 'x2apic', 'xsave', 'xsaveopt', 'xtopology', 'xtpr']
l3_cache_size: 25600 KB
l2_cache_size: 256 KB
l1_data_cache_size: 32 KB
l1_instruction_cache_size: 32 KB
l2_cache_line_size: 6
l2_cache_associativity: 0x100
extended_model: 4

So I'm not sure what causes the performance degradation for you... Could it be the operating system? I've tested this on Ubuntu 16.04.6 LTS (GNU/Linux 4.15.0-51-generic x86_64).

pseudotensor commented 4 years ago

@arnocandel want to chime in?

chi2liu commented 3 years ago

I tried to replicate this experiment, and here are the results that I've observed:

nthreads time 1 15.56 2 9.44 3 6.43 4 5.21 5 4.34 6 3.69 8 2.93 10 2.59 12 2.17 14 1.96 16 1.76 20 1.62 24 1.59 28 1.46 32 1.49 36 1.59 38 1.56 40 1.55 So, after 20 threads there is neither any benefit nor harm in using extra threads. This is all readin an 8GB file on a machine with the same CPU as yours:

vendor_id: GenuineIntel
brand: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
hz_advertised: 2.2000 GHz
hz_actual: 1.2373 GHz
hz_advertised_raw: [2200000000, 0]
hz_actual_raw: [1237284000, 0]
stepping: 1
model: 79
family: 6
flags: ['3dnowprefetch', 'abm', 'acpi', 'adx', 'aes', 'aperfmperf', 'apic', 'arat', 'arch_perfmon', 'avx', 'avx2', 'bmi1', 'bmi2', 'bts', 'cat_l3', 'cdp_l3', 'clflush', 'cmov', 'constant_tsc', 'cpuid', 'cpuid_fault', 'cqm', 'cqm_llc', 'cqm_mbm_local', 'cqm_mbm_total', 'cqm_occup_llc', 'cx16', 'cx8', 'dca', 'de', 'ds_cpl', 'dtes64', 'dtherm', 'dts', 'epb', 'ept', 'erms', 'est', 'f16c', 'flexpriority', 'flush_l1d', 'fma', 'fpu', 'fsgsbase', 'fxsr', 'hle', 'ht', 'ibpb', 'ibrs', 'ida', 'intel_ppin', 'intel_pt', 'invpcid', 'invpcid_single', 'lahf_lm', 'lm', 'mca', 'mce', 'md_clear', 'mmx', 'monitor', 'movbe', 'msr', 'mtrr', 'nonstop_tsc', 'nopl', 'nx', 'osxsave', 'pae', 'pat', 'pbe', 'pcid', 'pclmulqdq', 'pdcm', 'pdpe1gb', 'pebs', 'pge', 'pln', 'pni', 'popcnt', 'pqe', 'pqm', 'pse', 'pse36', 'pti', 'pts', 'rdrand', 'rdrnd', 'rdseed', 'rdt_a', 'rdtscp', 'rep_good', 'rtm', 'sdbg', 'sep', 'smap', 'smep', 'smx', 'ss', 'ssbd', 'sse', 'sse2', 'sse4_1', 'sse4_2', 'ssse3', 'stibp', 'syscall', 'tm', 'tm2', 'tpr_shadow', 'tsc', 'tsc_adjust', 'tsc_deadline_timer', 'tscdeadline', 'vme', 'vmx', 'vnmi', 'vpid', 'x2apic', 'xsave', 'xsaveopt', 'xtopology', 'xtpr']
l3_cache_size: 25600 KB
l2_cache_size: 256 KB
l1_data_cache_size: 32 KB
l1_instruction_cache_size: 32 KB
l2_cache_line_size: 6
l2_cache_associativity: 0x100
extended_model: 4

So I'm not sure what causes the performance degradation for you... Could it be the operating system? I've tested this on Ubuntu 16.04.6 LTS (GNU/Linux 4.15.0-51-generic x86_64).

Is hyperthreading enabled on your machine? I found that my result is because my machine has hyper-threading enabled, the actual physical core is only 20 cores, so the best performance is achieved at 20 cores. At the same time, we tested other machines with hyper-threading. Setting the number of parallel threads to the actual number of physical cores instead of the number of logical cores will enhance performance.