Closed danieldk closed 5 years ago
(Not directly related, rather curious) Did you investigate the impact on performance of both options? IIRC, one of the two did make little to no difference.
(Not directly related, rather curious) Did you investigate the impact on performance of both options? IIRC, one of the two did make little to no difference.
On my desktop machine (some new Core i5), I tried a bit. intra
made a big difference inter
only a bit. What also helped is reducing the epoch size to 32 and increasing the readahead to ~80. Does about 400 sentences per second on this 4 core / 8 thread desktop-clas CPU.
I also tried Tensorflow with MKL-DNN (and experimented with different number of MKL threads). This was generally terrible for RNNs and not great for transformers.
On my desktop machine (some new Core i5), I tried a bit.
intra
made a big differenceinter
only a bit.
Would it make sense to change the defaults from 4/4 to 6/2 then?
What also helped is reducing the epoch size to 32 and increasing the readahead to ~80.
I had a similar experience that smaller batch size helps with speed.
Does about 400 sentences per second on this 4 core / 8 thread desktop-clas CPU.
3 layers - 400?
On my desktop machine (some new Core i5), I tried a bit.
intra
made a big differenceinter
only a bit.Would it make sense to change the defaults from 4/4 to 6/2 then?
It does not really hurt to have more inter op threads either. So I think it's find to have 4/4 as well. On non-higher end machines it actually makes sense to set this to the number of logical CPUs (so, with hyperthreading). But I am a bit worried that people forget to set this on large-number of CPU machines. Maybe we can change the default to something like the logical number of CPUs, capped at 8.
I had a similar experience that smaller batch size helps with speed.
It may have something to do with less pressure on the allocator + fewer cache misses. Though I didn't measure that yet.
Does about 400 sentences per second on this 4 core / 8 thread desktop-clas CPU.
3 layers - 400?
That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.
That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.
Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.
On Fri, Nov 15, 2019, at 11:16, Tobias Pütz wrote:
Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.
Are you using Tensorflow compiled with FMA + AVX optimizations?
(Or even AVX2 or AVX512?)
Compiled on hopper with march native opt, tf prints no warnings.
Compiled on hopper with march native opt, tf prints no warnings.
Interesting! hopper should definitely beat my machine. Would still be interesting to try with just FMA + AVX, since AVX512 and heavy AVX2 lower the CPU frequency (which BTW can affect other processes on the same core group). Also see:
Frequency table for hopper's CPUs:
https://en.wikichip.org/wiki/intel/xeon_gold/6138#Frequencies
hopper lscpu: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Possible that your i5 can achieve higher clocks?
hopper lscpu: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Possible that your i5 can achieve higher clocks?
About the same, 3.8GHz vs 3.7Ghz. As mentioned above, it would be nice to try with just AVX + FMA. (I guess other processes would also need to be running without AVX2/AVX512 to avoid the worst frequency scaling.)
Tarball here: https://blob.danieldk.eu/libtensorflow/libtensorflow-cpu-linux-x86_64-avx-fma-1.15.0.tar.gz
That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.
Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.
Ah: culprit: I use 8/8, not 4/4.
Tarball here: https://blob.danieldk.eu/libtensorflow/libtensorflow-cpu-linux-x86_64-avx-fma-1.15.0.tar.gz
First run 197 sentences per second
That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.
Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.
Ah: culprit: I use 8/8, not 4/4.
With 8 logical cores and just avx + fma?
8/8 still 200-220 sentences per second, you're using a ssd, maybe faster read and writes help?
With 8 logical cores and just avx + fma?
Indeed.
8/8 still 200-220 sentences per second, you're using a ssd, maybe faster read and writes help?
Shouldn't make a difference, hopper has so much memory that the data should just be in the buffer cache. And even with a cold start, reading our validation/held-out data should only take a fraction of a second.
8/8 still 200-220 sentences per second, you're using a ssd, maybe faster read and writes help?
avx + fma: 190-200
Performance counters on my machine:
Performance counter stats for '/home/daniel/git/sticker/target/release/sticker tag transformer-gen1-finetune/sticker.conf --inter-op-threads 8 --intra-op-threads 8 --batchsize 32 --readahead 80 --input dev-clean.conll --output t-dev.conll':
110,836.79 msec task-clock # 5.238 CPUs utilized
337,524 context-switches # 3045.256 M/sec
66,618 cpu-migrations # 601.050 M/sec
2,456,885 page-faults # 22166.850 M/sec
360,034,338,774 cycles # 3248351.968 GHz
546,494,871,383 instructions # 1.52 insn per cycle
24,532,865,580 branches # 221343837.562 M/sec
151,471,436 branch-misses # 0.62% of all branches
21.161685396 seconds time elapsed
104.522825000 seconds user
7.600818000 seconds sys
Ones on dev, the other on val, right?
Just noticed, I was on a commit before mixed-precision was fixed for cpu execution, now I got 484 sentences per second on avx + fma and 450-460 on avx2 + avx 512
Ones on dev, the other on val, right?
Just noticed, I was on a commit before mixed-precision was fixed for cpu execution, now I got 484 sentences per second on avx + fma and 450-460 on avx2 + avx 512
Ah nice! Those look like the expected numbers :+1: .
So, we should also stop using AVX2/AVX512 possibly (it gets worse with more processes).
Sounds about right, could do a more exhaustive analysis at some point, maybe avx2/avx512 starts to shine with larger batch sizes or with larger operations?
These replace the
{inter,intra}_op_parallelism_threads
options in sticker configuration files. This makes it easier to change the number of threads.