Add {inter,intra}-op-threads command-line options

danieldk commented 5 years ago

These replace the {inter,intra}_op_parallelism_threads options in sticker configuration files. This makes it easier to change the number of threads.

twuebi commented 5 years ago

(Not directly related, rather curious) Did you investigate the impact on performance of both options? IIRC, one of the two did make little to no difference.

danieldk commented 5 years ago

(Not directly related, rather curious) Did you investigate the impact on performance of both options? IIRC, one of the two did make little to no difference.

On my desktop machine (some new Core i5), I tried a bit. intra made a big difference inter only a bit. What also helped is reducing the epoch size to 32 and increasing the readahead to ~80. Does about 400 sentences per second on this 4 core / 8 thread desktop-clas CPU.

I also tried Tensorflow with MKL-DNN (and experimented with different number of MKL threads). This was generally terrible for RNNs and not great for transformers.

twuebi commented 5 years ago

On my desktop machine (some new Core i5), I tried a bit. intra made a big difference inter only a bit.

Would it make sense to change the defaults from 4/4 to 6/2 then?

What also helped is reducing the epoch size to 32 and increasing the readahead to ~80.

I had a similar experience that smaller batch size helps with speed.

Does about 400 sentences per second on this 4 core / 8 thread desktop-clas CPU.

3 layers - 400?

danieldk commented 5 years ago

On my desktop machine (some new Core i5), I tried a bit. intra made a big difference inter only a bit.

Would it make sense to change the defaults from 4/4 to 6/2 then?

It does not really hurt to have more inter op threads either. So I think it's find to have 4/4 as well. On non-higher end machines it actually makes sense to set this to the number of logical CPUs (so, with hyperthreading). But I am a bit worried that people forget to set this on large-number of CPU machines. Maybe we can change the default to something like the logical number of CPUs, capped at 8.

I had a similar experience that smaller batch size helps with speed.

It may have something to do with less pressure on the allocator + fewer cache misses. Though I didn't measure that yet.

Does about 400 sentences per second on this 4 core / 8 thread desktop-clas CPU.

3 layers - 400?

That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.

twuebi commented 5 years ago

That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.

Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.

danieldk commented 5 years ago

On Fri, Nov 15, 2019, at 11:16, Tobias Pütz wrote:

Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.

Are you using Tensorflow compiled with FMA + AVX optimizations?

(Or even AVX2 or AVX512?)

twuebi commented 5 years ago

Compiled on hopper with march native opt, tf prints no warnings.

danieldk commented 5 years ago

Compiled on hopper with march native opt, tf prints no warnings.

Interesting! hopper should definitely beat my machine. Would still be interesting to try with just FMA + AVX, since AVX512 and heavy AVX2 lower the CPU frequency (which BTW can affect other processes on the same core group). Also see:

https://en.wikichip.org/wiki/intel/frequency_behavior

danieldk commented 5 years ago

Frequency table for hopper's CPUs:

https://en.wikichip.org/wiki/intel/xeon_gold/6138#Frequencies

twuebi commented 5 years ago

hopper lscpu: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

Possible that your i5 can achieve higher clocks?

danieldk commented 5 years ago

hopper lscpu: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz

Possible that your i5 can achieve higher clocks?

About the same, 3.8GHz vs 3.7Ghz. As mentioned above, it would be nice to try with just AVX + FMA. (I guess other processes would also need to be running without AVX2/AVX512 to avoid the worst frequency scaling.)

Tarball here: https://blob.danieldk.eu/libtensorflow/libtensorflow-cpu-linux-x86_64-avx-fma-1.15.0.tar.gz

danieldk commented 5 years ago

That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.

Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.

Ah: culprit: I use 8/8, not 4/4.

twuebi commented 5 years ago

Tarball here: https://blob.danieldk.eu/libtensorflow/libtensorflow-cpu-linux-x86_64-avx-fma-1.15.0.tar.gz

First run 197 sentences per second

twuebi commented 5 years ago

That's the transformer with the default settings. IIRC the RNN was ~300. Edit: the 3 layer, 400 units RNN does ~250 sentences per second.

Interesting I'm getting 200 sentences per second on hopper with bs 32 and read ahead 80 with 4/4 threads.

Ah: culprit: I use 8/8, not 4/4.

With 8 logical cores and just avx + fma?

twuebi commented 5 years ago

8/8 still 200-220 sentences per second, you're using a ssd, maybe faster read and writes help?

danieldk commented 5 years ago

With 8 logical cores and just avx + fma?

Indeed.

danieldk commented 5 years ago

8/8 still 200-220 sentences per second, you're using a ssd, maybe faster read and writes help?

Shouldn't make a difference, hopper has so much memory that the data should just be in the buffer cache. And even with a cold start, reading our validation/held-out data should only take a fraction of a second.

twuebi commented 5 years ago

8/8 still 200-220 sentences per second, you're using a ssd, maybe faster read and writes help?

avx + fma: 190-200

danieldk commented 5 years ago

Performance counters on my machine:

Performance counter stats for '/home/daniel/git/sticker/target/release/sticker tag transformer-gen1-finetune/sticker.conf --inter-op-threads 8 --intra-op-threads 8 --batchsize 32 --readahead 80 --input dev-clean.conll --output t-dev.conll':

        110,836.79 msec task-clock                #    5.238 CPUs utilized          
           337,524      context-switches          # 3045.256 M/sec                  
            66,618      cpu-migrations            #  601.050 M/sec                  
         2,456,885      page-faults               # 22166.850 M/sec                 
   360,034,338,774      cycles                    # 3248351.968 GHz                 
   546,494,871,383      instructions              #    1.52  insn per cycle         
    24,532,865,580      branches                  # 221343837.562 M/sec             
       151,471,436      branch-misses             #    0.62% of all branches        

      21.161685396 seconds time elapsed

     104.522825000 seconds user
       7.600818000 seconds sys

twuebi commented 5 years ago

Ones on dev, the other on val, right?

Just noticed, I was on a commit before mixed-precision was fixed for cpu execution, now I got 484 sentences per second on avx + fma and 450-460 on avx2 + avx 512

danieldk commented 5 years ago

Ones on dev, the other on val, right?

Just noticed, I was on a commit before mixed-precision was fixed for cpu execution, now I got 484 sentences per second on avx + fma and 450-460 on avx2 + avx 512

Ah nice! Those look like the expected numbers :+1: .

So, we should also stop using AVX2/AVX512 possibly (it gets worse with more processes).

twuebi commented 5 years ago

Sounds about right, could do a more exhaustive analysis at some point, maybe avx2/avx512 starts to shine with larger batch sizes or with larger operations?

stickeritis / sticker

Add {inter,intra}-op-threads command-line options #175