Hi,

I'm currently trying to get AlphaZero running in full parallelization, but I'm having issues at all levels of parallelization. I'm new to parallelization, so I might also have misunderstood some parts. I'm running it on a machine with 128 CPUS, but I cannot achieve a very high CPU utilization, no matter if I try multi-threading or multi-processing.

Multi-threading (without GPU): I have tried starting julia with different numbers of threads and different AlphaZero parameters, no matter if I start julia -t 128 or julia -t 20, htop only shows a CPU utilization of around 1200% for this process, so only around 12 threads are working. I was wondering if that is due to them waiting for the inference server, but I got similar results when using a very small dummy network. Also, SimParams.num_workers was 128 and batch size 64, so shouldn't other workers continue simulations while some are waiting for the inference server? If the inference is the reason, would I be better off with a small batch size or a large batch size? When not using a GPU, are there benefits of batching inference requests at all? I.e. is it better to use multi-threading or multi-processing on one machine? I also remember seeing a plot of AlphaZero performance over the number of workers somewhere in the AlphaZero.jl documentation or some post you made (i think), but I cannot find it anymore. Do you happen to know which plot I'm referring to?
Multi-processing (without GPU): When using multiple processes (on the same machine, e.g. julia -p 64), htop shows all workers having a high CPU load during self-play. However, if I understand correctly, this is a waste of resources, since each process has to start its own inference server. Or is this better when not using a GPU? What also confused me is that even even when calling single-threaded julia -p 64, htop shows multiple threads belonging to the main process during benchmarking (where AlphaZero does not use multi-processing). This is not problematic, I'm just trying to understand what's happening. I don't see how Util.mapreduce spawns multiple threads since Threads.nthreads() should be 1. Furthermore, it is 8 threads that are working at full load (the ones from julia -p 20 call, not 20, which would be the number of processes). So where does that number 8 come from?
Using GPU: When I try running AlphaZero.jl with GPU, for some reason it becomes incredibly slow, a lot slower than without GPU. htop now shows a CPU usage of around 500%: The machine has multiple GeForce RTX 2080 Ti with 10GB memory. Any ideas what could cause this?

Here are the parameters I used, in case this is relevant:

Network = NetLib.SimpleNet
netparams = NetLib.SimpleNetHP(
  width=128,
  depth_common=2,
  depth_phead=2,
  depth_vhead=2,
)

self_play = SelfPlayParams(
  sim=SimParams(
    num_games=5000,
    num_workers=128,
    batch_size=64,
    use_gpu=true,
    reset_every=1,
  ),
  mcts=PWMctsParams(
    num_iters_per_turn=1000,
    cpuct=1.0,
    prior_temperature=1.0,
    temperature=PLSchedule([10, 20, 30, 50], [1.0, 0.8, 0.3, 0.1]),
    dirichlet_noise_ϵ=0.25,
    dirichlet_noise_α=1.,
    k=1., 
    α=0.1,
  )
)

arena = ArenaParams(
  sim=SimParams(
    self_play.sim,
    num_games=128,
    use_gpu=true,
    reset_every=1,
  ),
  mcts=PWMctsParams(
    self_play.mcts,
    temperature=ConstSchedule(0.2),
    dirichlet_noise_ϵ=0.05,
  ),
  update_threshold=0.05)

learning = LearningParams(
  use_gpu=true,
  use_position_averaging=false,
  samples_weighing_policy=CONSTANT_WEIGHT,
  rewards_renormalization=10,
  l2_regularization=1e-4,
  optimiser=Adam(lr=5e-3),
  batch_size=1024,
  loss_computation_batch_size=1024,
  nonvalidity_penalty=1.,
  min_checkpoints_per_epoch=1,
  max_batches_per_checkpoint=5_000,
  num_checkpoints=1)

params = Params(
  arena=arena,
  self_play=self_play,
  learning=learning,
  num_iters=100,
  memory_analysis=nothing,
  ternary_rewards=false,
  use_symmetries=false,
  mem_buffer_size=PLSchedule(
    [      0,        30],
    [400_000, 1_000_000])
)

(PWMctsParams are for a progressive widening MCTS I've implemented for continuous states)

As a general question, I was also wondering about why you removed the asynchronous MCTS version - is there simply no benefit because CPU power can also be used to parallelize over different MCTS tree instead of within the same tree?

Any help is appreciated!

I have tried starting julia with different numbers of threads and different AlphaZero parameters, no matter if I start julia -t 128 or julia -t 20, htop only shows a CPU utilization of around 1200% for this process, so only around 12 threads are working.

First of all, you should note that in some applications, there are legitimate reasons for the CPU utilization to be low. For example, if the main computational cost is to run inference and inference runs on GPU, having a high number of CPUs simply isn't useful as there just isn't enough work for them.

One way to figure out whether or not this is the case is to profile network inference and vanilla MCTS separately and compare the numbers. In your case, you seem to be using a very small network so I doubt inference would be the only bottleneck though. Also, if you are doing inference on CPU, note that the inference server is only going to use a single CPU because of this line, which you may want to try and comment out (restricting inference to one core is sometimes useful to avoid too much cache thrashing but it may backfire here).

Low CPU utilization may also be due (at least in part) to some GC or multitasking overhead. Indeed, even if you are not using a GPU (whose interaction with the Julia GC raises known issues), the Julia GC sometimes has problems at dealing with multiple threads that allocate a lot (this being worked on right now as far as I understand). Unfortunately, the current tools are not great for detecting and quantifying this kind of overhead.

When not using a GPU, are there benefits of batching inference requests at all? I.e. is it better to use multi-threading or multi-processing on one machine?

The benefits will certainly be greatly reduced but even a CPU can exploit some amount of parallelism so batching inference queries can be worth it, up to a point where the overhead of having many workers becomes too large (more allocations, more cache invalidation and context switching...)

I also remember seeing a plot of AlphaZero performance over the number of workers somewhere in the AlphaZero.jl documentation or some post you made (i think), but I cannot find it anymore. Do you happen to know which plot I'm referring to?

You may be referring to this figure, which I created for an old version and haven't regenerated since.

Multi-processing (without GPU):

The only interest of using multiple processes is if you want to leverage multiple GPUs (as CUDA.jl used to have very poor multi-gpu support) or distribute the computtion on a cluster of machines. Also, I believe that in recent versions Julia starts multiple threads by default unless you pass the -t 1 option explicitly.

Using GPU:

The network you are using is so small that the overhead of transferring data to the GPU is probably much higher than doing inference on CPU so I am not surprised here.

As a general question, I was also wondering about why you removed the asynchronous MCTS version - is there simply no benefit because CPU power can also be used to parallelize over different MCTS tree instead of within the same tree?

I removed it for the sake of simplicity as I was refactoring the code and realized that in most applications, batching inference requests across game simulations is enough and does not lead to the exploration bias induced by the use of a virtual loss. I may reintroduce it at some point though as using it in combination with parallel simulation workers may lead to significant performance improvements in some applications (by reducing the numbers of MCTS trees that are to be kept in memory simultaneously or enabling even bigger batches).

More general advice

My more general advice for you is to start by running a lot of small profiling experiments: how much does it cost to simulate a step in my environment? How much does it cost to evaluate a batch using the CPU/GPU. What is the ideal batch size for my config?

Once you've done this, you can get a back-of-the-envelope estimate for the ideal performances you should expect from the full training loop, assuming no GC/multitasking overhead. Then, you can compare this number to the actual performances you are currently getting and only then work on reducing the gap.

Thank you for the detailed answer, this is much appreciated!

One way to figure out whether or not this is the case is to profile network inference and vanilla MCTS separately and compare the numbers.

Alright, I will do some profiling to get a better feeling for that.

You may be referring to this figure, which I created for an old version and haven't regenerated since.

Yes, that's what I was looking for! Thanks, even though this is probably not relevant anymore then.

The only interest of using multiple processes is if you want to leverage multiple GPUs (as CUDA.jl used to have very poor multi-gpu support) or distribute the computtion on a cluster of machines.

Conceptually, I understand this and agree. However, I'm wondering if it could still be useful is this case since multi-threading seems to not fully utilize resources and batching inferences seems less valuable when not using a GPU. I will do more experiments to see which configuration of threads and processes is fastest.

My more general advice for you is to start by running a lot of small profiling experiments: how much does it cost to simulate a step in my environment? How much does it cost to evaluate a batch using the CPU/GPU. What is the ideal batch size for my config?

Thanks, I will do that!

Hi,

Interesting topic.

I have been doing training purely on CPU machines. No changes to the AlphaZero code with the exception of use_gpu set to false in parameters file. My observations:

I have been able to achieve significantly (~ order of magnitude) shorter times when using multiple workers (distributed) and threads than only multiple threads. I mean, I have seen no point in using only multiple threads - "Self play" was very slow in such a case.
Until Iteration 9 I was able to train on a machine with 192GB RAM (24 OCPUs) with average calculation time per 1 iteration of ~15.5h then I had to switch to a machine with more RAM.
On 64 OCPUs and 384GB RAM calculation time for Iteration 10 and 11 increased to ~22h. I have seen slightly shorter time (~10% shorter) when using all 64 online CPUs than only on 32 Physical Cores. The highest RAM utilization I saw during Iteration 11 was about 320GB.
I have seen full utilization of all cores only at the first part of the first step ("Self play"). For later stages in case of 24 OCPU's the observed utilization was about ~20% and about ~8% in case of 64 OCPUs.

I found the following question asked by johannes-fischer particularly interesting:

Furthermore, it is 8 threads that are working at full load (the ones from julia -p 20 call, not 20, which would be the number of processes). So where does that number 8 come from?

Would it be possible to receive additional information on this topic please?

Best regards!

Edit: some typos

It is interesting and surprising that using several processes is so much faster than using several threads. A possible advantage of processes over threads is that they each have their own GC and so each process can collect its garbage without stopping the world (this should be improved in the future). But could this explain an order of magnitude difference? I feel like I am missing something here and I would like to understand this better.

Also, @idevcde: this sounds like you observed a significant slowdown during training. Is it possible that the system was simply using too much memory and ended up spending too much time fetching pages from the swap? Did you look at the ratio of time spent in GC in the generated report?

More generally, I must admit that AlphaZero.jl could be more efficient in the way it handles memory (using async MCTS to diminish the number of MCTS trees to maintain without diminishing batch size, MCTS implementation without state sharing, storing samples on disk...). That being said, I've had several reports of increasing memory consumption during training to an extent that I find surprising. I do not see how AlphaZero.jl could be leaking memory (as there is very little shared state between iterations) and so I would like more data on this. May the GC be at fault once again?

What also confused me is that even even when calling single-threaded julia -p 64, htop shows multiple threads belonging to the main process during benchmarking (where AlphaZero does not use multi-processing). This is not problematic, I'm just trying to understand what's happening. I don't see how Util.mapreduce spawns multiple threads since Threads.nthreads() should be 1. Furthermore, it is 8 threads that are working at full load (the ones from julia -p 20 call, not 20, which would be the number of processes). So where does that number 8 come from?

Could it be BLAS spawning 8 threads to perform the linear algebra during inference? The number 8 also appears here: https://github.com/JuliaLang/julia/issues/33409.

Thank you.

It is interesting and surprising that using several processes is so much faster than using several threads.

I have to admit that I am surprised as well. I was expecting multithreading being faster.

But could this explain an order of magnitude difference?

I just checked the notes and unfortunately it seems that I do not have any notes on those stopped pure multithreading trainings. I think that the fist step ("Self play") was about 5-10 times slower vs Distributed training with the number 10 being closer to my judgement.

I feel like I am missing something here and I would like to understand this better.

I do training using the following code:

using Distributed addprocs(64) @everywhere using AlphaZero @everywhere experiment = Examples.experiments["connect-four"] session = Session(experiment, dir="/home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/sessions/connect-four") resume!(session)

Julia 1.6.1. and Julia 1.7 beta3, AlphaZero, MKL and LinearAlgebra as the only additional Julia packages installed. No changes to the code, with exception of gpu set to false. All successful trainings with libopenblas64_.so. (or at last for 1.7 beta3 as I was not checking it for 1.6.1 so no MKL, however, my guess is that it was the same).

Also, @idevcde: this sounds like you observed a significant slowdown during training.

I guess you are referring to multithread trainings. Because as for Distributed, I have to admit that I am pretty happy with the times on CPU only machine, especially for the first step (Self play) which took 3:48:29 at Iteration 11.

Is it possible that the system was simply using too much memory and ended up spending too much time fetching pages from the swap? Did you look at the ratio of time spent in GC in the generated report?

Unfortunately I did not. I will try to further investigate the topic of swap/GC and will also try to consult with some of my colleagues. Should I be able to provide additional information I will do it here.

I do not see how AlphaZero.jl could be leaking memory [...]

According to the best of my knowledge there are no memory leaks. However, please be advised that I am not an ultra experienced developer.

Could it be BLAS spawning 8 threads to perform the linear algebra during inference? The number 8 also appears here: JuliaLang/julia#33409.

Thank you. As for the BLAS, somehow, the previous link provided by you seemed not to work and it was not readable upfront. I will try to investigate this topic further when time permits.

As for the BLAS, I have one additional question. I am also wondering if the use of MKL library might bring any benefit?

Iteration 1-9 I trained on Julia 1.6.1. with libopenblas64_.so. Iteration 10 and 11 on Julia 1.7 beta 3 which I understand brings native support for libmkl_rt.so when using MKL package. In general on simple matrix multiplication I saw about 30% decrease of calculation time on one of the machines (I think it was Intel Xeon Gold 6128 CPU @ 3.40GHz) when using libmklrt.so vs libopenblas64.so. However, when trying to launch AZ training with MKL I received the following error:

OMP: Error #34: System unable to allocate necessary resources for OMP thread: OMP: System error #11: Resource temporarily unavailable OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

signal (6): Aborted in expression starting at /home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/mytraining1.jl:84 gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line) kmp_abort_process at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) kmp_fatal at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) kmp_create_worker at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) kmp_allocate_thread at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) kmp_allocate_team at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) kmp_fork_call at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) __kmpc_fork_call at /home/xxxxxx/.julia/artifacts/947793e42b663bacd09f00d96aa96a47095f3b1c/lib/libiomp5.so (unknown line) omp_simple_3d at /home/xxxxxx/.julia/artifacts/a8e009985328801a84c9af6610f94f77a7c12852/lib/libmkl_intel_thread.so.1 (unknown line) gemm_omp_driver_v2 at /home/xxxxxx/.julia/artifacts/a8e009985328801a84c9af6610f94f77a7c12852/lib/libmkl_intel_thread.so.1 (unknown line) mkl_blas_sgemm at /home/xxxxxx/.julia/artifacts/a8e009985328801a84c9af6610f94f77a7c12852/lib/libmkl_intel_thread.so.1 (unknown line) sgemm at /home/xxxxxx/.julia/artifacts/a8e009985328801a84c9af6610f94f77a7c12852/lib/libmkl_intel_ilp64.so.1 (unknown line) sgemm at /home/xxxxxx/.julia/artifacts/a8e009985328801a84c9af6610f94f77a7c12852/lib/libmkl_rt.so (unknown line) gemm! at /home/xxxxxx/.julia/packages/NNlib/CSWJa/src/gemm.jl:51 [inlined] macro expansion at /home/xxxxxx/.julia/packages/NNlib/CSWJa/src/impl/conv_im2col.jl:58 [inlined]

741#threadsfor_fun at ./threadingconstructs.jl:85

unknown function (ip: 0x7fb7e417cf05) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined] jl_f__call_latest at /buildworker/worker/package_linux64/build/src/builtins.c:757

invokelatest#2 at ./essentials.jl:716 [inlined]

invokelatest at ./essentials.jl:714 [inlined] macro expansion at ./threadingconstructs.jl:90 [inlined]

conv_im2col!#390 at /home/xxxxxx/.julia/packages/NNlib/CSWJa/src/impl/conv_im2col.jl:49

conv_im2col! at /home/xxxxxx/.julia/packages/NNlib/CSWJa/src/impl/conv_im2col.jl:30 [inlined]

155 at ./threadingconstructs.jl:178

unknown function (ip: 0x7fb7e417b562) _jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2245 [inlined] jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2427 jl_apply at /buildworker/worker/package_linux64/build/src/julia.h:1787 [inlined] start_task at /buildworker/worker/package_linux64/build/src/task.c:878 Allocations: 221650728 (Pool: 221595669; Big: 55059); GC: 83 /var/spool/torque/mom_priv/jobs/"jobnumber.server": line 19: 22711 Aborted (core dumped) julia -t 64 mytraining1.jl

I did the MKL test without using Distributed. AlphaZero, MKL as the only additional Julia packages installed. The code as follow:

using MKL using LinearAlgebra BLAS.get_config() using AlphaZero experiment = Examples.experiments["connect-four"] session = Session(experiment, dir="/home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/sessions/connect-four") resume!(session)

Do you have any opinion? And additional advice or comment on this topic/s?

Best regards!

Edit: removed "and LinearAlgebra". Was: "AlphaZero, MKL and LinearAlgebra as the only additional Julia packages installed." Is: "AlphaZero, MKL as the only additional Julia packages installed."

I have been able to achieve significantly (~ order of magnitude) shorter times when using multiple workers (distributed) and threads than only multiple threads.

@idevcde thanks for your insights, could you provide some more information on how many workers and threads you were using depending on the number of available CPUs to achieve best results?

Did you look at the ratio of time spent in GC in the generated report?

@jonathan-laurent I had a look at some of the performance plots and it seems that GC in multi-processing is very small, but for only multi-threading is at around 40% of self-play (I didn't use both at the same time yet). In one multi-processing run GC jumped from 5% to 30% at iteration 28, though. I don't know, if swapping became necessary at this point.

@johannes-fischer Thanks! This seems to indicate that a lot of the overhead we're seeing here comes from the GC not performing great with multiple heavily allocating threads.

I am not sure the performance drop you are seeing has anything to do with swapping as I would expect swapping to slow down the program uniformly, but it is not the first weird GC behavior I have been observing.

@johannes-fischer

thanks for your insights, could you provide some more information on how many workers and threads you were using depending on the number of available CPUs to achieve best results?

I am not sure if those are the absolute best possible results. Please be advised that I am new to this topic. However, I have tried more than a few different combinations. I am providing additional info.

My observations are related to 11 iterations. In general I have used mostly two machines (please see below) with default settings of AlphaZero v0.5.1. (the only change being gpu set to false in the parameters file). I used two versions of Julia (1.6.1 and 1.7.0-beta3.0). It was run on Ubuntu 20.04.3 LTS (Focal Fossa).
The conclusions are based on free observations, which means I have not done any statistical calculations and that there might be some slight additional deviations to the ones I listed above which I am not covering in detail. For example, I might have used for any particular training a gpu set to true on a CPU only machine or a slightly different version of Ubuntu or a very slightly different version of AlphaZero. However, this would be rather a very rare situation.
One machine I used is Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz with 192GB RAM (this is 6 physical cores per socket with 2 sockets which means 12 physical and 24 logical cores). The other one is Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz with 384GB RAM (this is 16 physical cores per socket with 2 sockets which means 32 physical and 64 logical cores). All the findings are related to those two machines.
I have also tried Intel(R) Core(TM) i9-10920X CPU @ 3.50GHz with 32GB RAM and Intel(R) Xeon(R) E-2176G CPU @ 3.70GHz with 64GB RAM. However, due to the limited RAM footprint my trainings crashed very shortly or quite shortly after the start.

I think that I achieved the best results in terms of computational time per iteration when using maximum number of available logical threads and maximum number of available logical distributed workers. I would say that using distributed workers run only on physical cores and at the same time using threads run only on physical cores is increasing computational time by more or less 10%. In addition I would say that using only distributed workers (no matter if they are run only on physical or on all logical cores) and not using threads at all is increasing computational time by additional 10%. As I wrote in my previous post, I would say that using only all available logical threads and not using distributed workers at all, makes the first part of the training (Self play) very slow, like 10 times slower. Moreover, as I wrote earlier, I saw full utilization of CPU (observed in "top -H" on Linux) only during the first part of the training (Self play). During later stages, I saw only about 8 workers mostly fully utilized and additional few (like additional 2 to 4) utilized at 5 to 25% from time to time). As I wrote earlier, I was able to run iteration 1 to 9 (average calculation time per 1 iteration was about 15.5h) on the first machine with 192GB RAM. Than I had to use a machine with more RAM as 192GB was insufficient. During iteration 11 on the second machine with 384GB RAM, the highest RAM utilization I spotted was about 320GB. Average calculation time for iteration 10 and 11 was about 22h (maybe slightly less, as in those cases I was not able to get very precise timings as calculations for next iteration started before I sent a break commands). At iteration 11 [julia -t 64 mytraining1.jl / using Distributed and addprocs(64)] I see the following results:

Starting iteration 11

Starting self-play

Progress: 100%|█████████████████████████████████████| Time: 3:48:29                                 

Generating 12 samples per second on average                                 
Average exploration depth: 9.4                                  
MCTS memory footprint per worker: 6.02MB                                    
Experience buffer size: 800,000 (346,387 distinct boards)

Starting learning

Optimizing the loss                                 

     Loss       Lv       Lp     Lreg     Linv       Hp    Hpnet                                 
   0.3465   0.0973   0.1467   0.1022   0.0001   0.6879   0.8369                                 
   0.3468   0.1010   0.1451   0.1004   0.0002   0.6879   0.8392                                 

Launching a checkpoint evaluation                                   

  Progress: 100%|███████████████████████████████████| Time: 6:31:51                                 

  Average reward: +0.41 (58% won, 26% draw, 16% lost, network replaced), redundancy: 51.7%

Running benchmark: AlphaZero against MCTS (1000 rollouts)

Progress: 100%|█████████████████████████████████████| Time: 3:37:14                                 

Average reward: +0.99 (100% won, 0% draw, 0% lost), redundancy: 29.0%

Running benchmark: Network Only against MCTS (1000 rollouts)

Progress: 100%|█████████████████████████████████████| Time: 0:01:14                                 

Average reward: +0.88 (93% won, 1% draw, 5% lost), redundancy: 26.4%

@jonathan-laurent, @johannes-fischer Should you have any comments or any advise please let me know. Also do you think that trying to further investigate MKL.jl package that I mentioned earlier might be a good idea? I have also read about Octavian.jl. Does it make sense to investigate those kind of packages at all?

I think that I achieved the best results in terms of computational time per iteration when using maximum number of available logical threads and maximum number of available logical distributed workers.

@idevcde Thanks! So to make that explicit, on the machine with 64 logical cores you used julia with 64 worker processes and each of them used 64 threads?

@idevcde Thanks for those details! I do not think it would be very useful to experiment with MKL.jl or Octavian.jl as I am pretty sure the performance problems do not come from whatever linear algebra library is used.

The one thing you can try is comment out the following line in the src/AlphaZero.jl:

LinearAlgebra.BLAS.set_num_threads(1)

Apart from this, the gap in performance probably results from a combination of GC / multitasking overhead.

Apart from this, the gap in performance probably results from a combination of GC / multitasking overhead.

I think the inference might also play a major role in this. Currently, I'm running a training with only multi-threading (32 threads) and during running the MCTS only benchmark with RolloutOracle the CPU utilization is close to 3200%. But during self-play (and presumably also during AlphaZero benchmark) the CPU utilization drops to the previously reported levels.

@johannes-fischer So the question here is: if inference is the bottleneck, how can we make sure that a high number of cores is dedicated to inference? Here, there are two things you can do:

Remove the inference server altogether and have each worker do its own inference on a single core.
Keep the inference server, make big batches and make sure that several cores are used to evaluate the batches. The current BLAS config may force the inference server to use one core only, leading to slowdowns. Also, on second thoughts, investigating MKL.jl and Octavian.jl as suggested by @idevcde may actually not be a bad idea if it enables leveraging more cores (although I am pretty sure BLAS could be configured properly here).

@johannes-fischer

Thanks! So to make that explicit, on the machine with 64 logical cores you used julia with 64 worker processes and each of them used 64 threads?

Thanks! This I can not confirm. I wrote above that I did various experiments and I presented most of my findings above. As you can see results of iteration 11 presented above, for this particular run, I started Julia with 64 threads and than added 64 distributed worker processes. I do not understand in full the code inside AlphaZero.jl and I do not understand in detail what is going on under the hood.

@jonathan-laurent

The one thing you can try is comment out the following line in the src/AlphaZero.jl: LinearAlgebra.BLAS.set_num_threads(1)

Thanks! I will do the test ASAP, however, it may not be immediately.

The current BLAS config may force the inference server to use one core only, leading to slowdowns. Also, on second thoughts, investigating MKL.jl and Octavian.jl as suggested by @idevcde may actually not be a bad idea if it enables leveraging more cores (although I am pretty sure BLAS could be configured properly here).

Can I ask, out of the whole AlphaZero calculations, what percentage on average is linear algebra? Does it significantly differ among particular stages (self-play, learning, benchmark: AlphaZero against MCTS, benchmark: Network Only against MCTS)?

If you look at the chart located at https://github.com/JuliaLinearAlgebra/Octavian.jl [https://raw.githubusercontent.com/JuliaLinearAlgebra/Octavian.jl/master/docs/src/assets/bench10980xe.svg] which side is more relevant to AlphaZero; left or right or maybe AlphaZero is out of this chart?

bench10980xe

Edit: attached bench10980xe chart

I started Julia with 64 threads and than added 64 distributed worker processes.

Thanks, afaik in this case only the main process has 64 threads whereas the workers are only single threaded (unless you use addprocs() with exeflags argument to specify the number of threads or start julia -p 64 -t 64). Since threads cannot be added at runtime, also AlphaZero code won't be able to change that.

@johannes-fischer Thanks. This is a new information for me. I will check your suggestion when ... time permits.

@jonathan-laurent

Did you look at the ratio of time spent in GC in the generated report?

Did you have those ratios on your mind (report.json)? Please be advised that there were sometimes slightly different settings as I indicated above. All computations for iteration 1 to 9 on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz (2 sockets) with 192GB RAM, Julia 1.6.1., AlphaZero v0.5.0, [ILP64] libopenblas64.so. All computations for iteration 10 and 11 on Intel(R) Xeon(R) Platinum 8153 CPU @ 2.00GHz with 384GB RAM (2 sockets), Julia Version 1.7.0-beta3.0 (2021-07-07), AlphaZero v0.5.1 or v0.5.0 (to be confirmed), [ILP64] libopenblas64.so.

Iteration | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- Duration (h m s) | 14:18:43 | 12:37:41 | 14:58:54 | 15:56:05 | 16:18:05 | 17:18:30 | 15:37:34 | 15:27:55 | 15:38:36 | ~22:38:38 (this includes Iteration 11 (next iteration) self play up to 1%) | ~21:55:15 (this includes Iteration 12 (next iteration) self play started and being at 0%) | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" | "perfs_self_play" "gc_time"/"time" | 0.000027 | 0.000071 | 0.00026559 | 0.000467688 | 0.001762176 | 0.003331976 | 0.000254448 | 0.000945091 | 0.000749243 | 0.001182 | 0.005342819 | | | | | | | | | | | | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" | "perfs_memory_analysis" "gc_time" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | | | | | | | | | | | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" | "perfs_learning" "gc_time"/"time" | 0.015199525 | 0.011169273 | 0.009918299 | 0.009652197 | 0.007352897 | 0.021725253 | 0.006199996 | 0.009827222 | 0.009977426 | 0.008390326 | 0.011541601

arena benchmark_lost_games benchmark_reward benchmark_won_games entropies exploration_depth loss nsamples

Loss iteration 1 vs 11

Performance iteration 1 vs 11

Summary iteration 1 vs 11

Enclosed: report.json files from /sessions/connect-four/iterations/ iterations.zip

Edit:

Corrected "gc_time"/"time" values for iteration 1 and 2 from scientific to decimal format.
Removed: "As for now, I am able to provide data for iteration 1 to 9." Added data for iteration 10 and 11.
Added charts. Added report.json files (iterations.zip).

@johannes-fischer

[...] in this case only the main process has 64 threads whereas the workers are only single threaded [...]

Is it bad?

I did some additional experiments on iteration 12. Hope my notes are correct.

The one thing you can try is comment out the following line in the src/AlphaZero.jl: LinearAlgebra.BLAS.set_num_threads(1)

I commented out this line.

When using libopenblas64_.so on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz and

starting julia 1.7.0-beta3.0 with 12 threads (no distributed processes) and setting BLAS.set_num_threads(12) I see %Cpu(s) at about 31.8 us; I do not have reliable notes for %CPU; RAM utilization is at about 9GB, the Self play phase is very slow;
starting julia 1.7.0-beta3.0 with 24 threads (no distributed processes) and setting BLAS.set_num_threads(24) I see %Cpu(s) at about 54.7 us; %CPU is about 2315%; RAM utilization is at about 10GB and slowly growing, the Self play phase is maybe slightly faster however still very slow.

Keep the inference server, make big batches and make sure that several cores are used to evaluate the batches. The current BLAS config may force the inference server to use one core only, leading to slowdowns. Also, on second thoughts, investigating MKL.jl and Octavian.jl as suggested by @idevcde may actually not be a bad idea if it enables leveraging more cores (although I am pretty sure BLAS could be configured properly here).

When using libmkl_rt.so on Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz and

starting julia 1.7.0-beta3.0 with 12 threads (I understand that it is the maximum value as according to my current knowledge MKL supports only physical cores), (no distributed processes), I see %Cpu(s) at about 54.8 us, %CPU is about 2144%; RAM utilization is at about 5GB, the Self play phase is very slow, after about 5/6 hours it is at 3%.
when I start julia 1.7.0-beta3.0 with 12 threads and later add 12 processes, it seems that the MKL workers are starting correctly, however when the Self play starts I do see:

From worker 8: OMP: Error #34: System unable to allocate necessary resources for OMP thread:
From worker 8: OMP: System error #11: Resource temporarily unavailable
From worker 8: OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

and a lot of additional info for a few if not for all (probably for all) other workers.

it should be noted that similar situation is when I start julia 1.7.0-beta3.0 with julia -p 12 -t 12.

Below please find BLASBenchmarksCPU.jl results [rb = runbench(sizes = logspace(10, 10_000, 200)); plot(rb, displayplot = true);]. Julia started with -t 24 and -t 64 respectively to the following CPUs.

Gold_6128_CPU_3.40GHz_192GB

gemm_Float64_10_10000_skylake-avx512_AVX512__multithreaded_10to10000 gemm_Float64_10_1000_skylake-avx512_AVX512__multithreaded_10to1000 gemm_Float64_10_200_skylake-avx512_AVX512__multithreaded_10to200 gemm_Float64_1_75_skylake-avx512_AVX512__multithreaded_1to75

Platinum_8153_CPU_2.00GHz_384GB

@johannes-fischer

I'm running it on a machine with 128 CPUS [...] The machine has multiple GeForce RTX 2080 Ti with 10GB memory.

What exactly is your machine architecture / specification if I may ask? Is this all ... on one board? Would you consider providing additional information on training connect-four with default settings on CPU/GPUs?

@jonathan-laurent Please be informed that I just spotted that AlphaZero was automatically updated to v0.5.2, so my parameters settings (use_gpu=false) and probably commenting out "LinearAlgebra.BLAS.set_num_threads(1)" might not be fully reflected (correctly) in some of my tests for Iteration 12 about which I wrote recently (https://github.com/jonathan-laurent/AlphaZero.jl/issues/71#issuecomment-913216094), in particular, this may be most relevant to the use of MKL in the process per core execution model.

I will have to reevaluate and do some additional tests to confirm it for which I will continue using a perturbed parameter approach then I hope to switch to a perturbed-tendency and a stochastic back-scatter approaches also in order to cross-validate the probability distribution representing the uncertainty state as well as to get a better understanding of differences in the patterns of observed variables so the fastest growing characteristic patterns of differences might be correctly assessed with the use of appropriate uncertainty magnitudes and the multivariate extension of conjugate Bayesian updating of a Gaussian prior distribution.

It is just a hobby project, however, please be informed that I am interested in better understanding your instructions level parallelism in order to take maximum advantage of all of the FLOPS on the machines. Should you have any additional comments or any advise please let me know.

@johannes-fischer [...] in this case only the main process has 64 threads whereas the workers are only single threaded [...] Is it bad?

To be honest, I don't know. I also tried this configuration and to me it seemed as if it allocated a lot of RAM compared to multithreading on 1-2 workers. So currently I'm running experiments with 2 workers and 32 threads on each process.

What exactly is your machine architecture / specification if I may ask?

Yes, it is a server with a AMD EPYC 7702P 64-Core Processor. I am not working on the connect-four problem for myself. I could probably still start a connect-four training with default settings, but the server is not only used by me. What configuration exactly are you interested in?

In general, I have to admit that I am surprised that I was able to run iterations on a CPU only machine and to achieve the level described above at iteration 3. If one look at some videos and posts more or less, the picture drawn (or at least it was my understanding) is that it is almost intractable to use it on CPUs.

@johannes-fischer

To be honest, I don't know. I also tried this configuration and to me it seemed as if it allocated a lot of RAM compared to multithreading on 1-2 workers.

I understand, it is with Hyper Threading thus is the elevated level of RAM allocation. This configuration is one of the best I have found. I also found some additional notes on iteration 7, 8 and it seems that I run it with -p 12. The times seem to be good and despite I do not have precise notes, I would expect lower RAM utilization. I guess that HT on and off might be very dependent on a particular CPU.

So currently I'm running experiments with 2 workers and 32 threads on each process.

What exactly are the commands and the code you are using? What are the results?

What configuration exactly are you interested in?

Currently, I am interested most to "achieve a very high CPU utilization".

I was advised by a person with deep knowledge in this area that in case of MKL on Platinum 8153 and parallelized code (as I understand it) best would be to use OMP_NUM_THREADS=1 julia -p 32 -t 1 and to set BLAS.set_num_threads(1), and in case of single threaded Julia code calling expensive BLAS operations, best would be to set BLAS.set_num_threads(32). I did some tests with OMP_NUM_THREADS=1 julia -p 32 -t 1 and it seems to be that for iteration 12 I was able to achieve very similar timing for the first part of the training as during iteration 11 which I started with 64 threads and than added 64 distributed worker processes and used openblas with default AlphaZero settings. However, in case of iteration 12, the next part was done only on 1 thread whereas during iteration 11 I saw about 8 of them working. I broke this training and I do not have any details for the next phases. I also did a test with OMP_NUM_THREADS=1 julia -t 12, added 12 distributed worker processes and I set BLAS.set_num_threads(12) on 24 OCPU's machine. The times of the first part were comparable to achieved on Gold 6128, however, I would prefer not to provide details as I used a different 24 OCPU's machine than Gold 6128. What I think is interesting is that the learning phase (the second phase) in this case was done only on about two threads which were not fully utilized with %Cpu(s) at 4.3 us. As for now I do not have any details for the next phases.

I will be busy, however, this is an interesting topic for me and I would be monitoring it.

I gave this a little more thought and I suspect the most decisive way to move this debate forward is to have AlphaZero.jl produce some profiling information on how workers communicate with the inference server.

I opened an issue with details here: https://github.com/jonathan-laurent/AlphaZero.jl/issues/76

If anyone wants to have a go at it, this would constitute a major contribution to AlphaZero.jl.

@jonathan-laurent Thank you for such an interesting package as AlphaZero.jl and for the suggestion as I understand of a new profiling package.

Would you please consider providing more information on your current understanding of the problem of "workers communicating with the inference server" as well as on the program architecture?
Would it also be possible to receive some information on my previous questions? Particularly, I have on my mind the questions related to:
- the approximate percentage of linear algebra calculations wrt the total calculations done during one iteration; and
- approximate differences between particular stages (self-play, learning, benchmark: AlphaZero against MCTS, benchmark: Network Only against MCTS) if they occur at all; as well as
- to the area of BLASBenchmark.jl charts presented above that is targeted by the connect-four?
Also would it be possible to receive information on the suggested way of running AlphaZero.jl on multiple servers / boards?

May I ask if there is any suggested way of running it with the use of MPI as a transport?
Would you have any suggestion on the use of MPIClustersManagers.jl in particular?
Moreover, would you have any suggestion on running AlphaZero.jl with the use of Gasp.jl? What is your position on this or a similar technology at all?

I have to admit that I did additional test with MKL. I was able to finish iteration 12 within 9h 24m 34s.

Why is the combined time of self-play, learning, benchmark: AlphaZero against MCTS, benchmark: Network Only against MCTS only 05:22:04? What is the program doing for the remaining 04:02:30? I know that you are lacking appropriate software, however, what would be your approximate judgment on it? Starting iteration 12

Starting self-play

Progress: 100%|█████████████████████████████████████| Time: 1:12:31

Generating 36 samples per second on average
Average exploration depth: 8.7
MCTS memory footprint per worker: 6.13MB
Experience buffer size: 840,000 (367,796 distinct boards)

Starting learning

Optimizing the loss
```
 Loss       Lv       Lp     Lreg     Linv       Hp    Hpnet                                 
```
0.3587 0.1100 0.1480 0.1004 0.0002 0.7073 0.8566
0.3331 0.0911 0.1426 0.0991 0.0002 0.7073 0.8592

Launching a checkpoint evaluation

Progress: 100%|███████████████████████████████████| Time: 2:35:56

Average reward: -0.13 (9% won, 70% draw, 22% lost), redundancy: 54.9%

Running benchmark: AlphaZero against MCTS (1000 rollouts)

Progress: 100%|█████████████████████████████████████| Time: 1:32:55

Average reward: +1.00 (100% won, 0% draw, 0% lost), redundancy: 28.1%

Running benchmark: Network Only against MCTS (1000 rollouts)

Progress: 100%|█████████████████████████████████████| Time: 0:00:42
In case of starting the calculations with: OMP_NUM_THREADS=1 julia -p "number of physical cores" -t 1 mytraining1.jl and running it with: @everywhere begin using LinearAlgebra, MKL BLAS.set_num_threads(1) using AlphaZero experiment = Examples.experiments["connect-four"] end session = Session(experiment, dir="/home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/sessions/connect-four") resume!(session) What do you think about changing the number of MKL threads from 1 to the number of physical cores after the "Self play" stage? Do you think that it may be possible and could bring positive results? If so, where would be the best place to enter the appropriate line: BLAS.set_num_threads("number of physical cores")? If you think that this might not be possible, do you think that saving the "Self play" stage to "in memory file" and starting julia with different settings might bring any positive result?

I understand that it is a lot of questions, however, I would appreciate even a very short answers.

@idevcde Unfortunately, I do not currently have the time to engage more deeply in this conversation and answer all your questions. Here are a few thoughts though:

As I suggested, adding better profiling support to AlphaZero.jl is the key to answer many of your questions. Without profiling, we are mostly operating on the basis of conjectures. You should feel welcome to contribute here if you have the time and motivation. Otherwise, I'll probably do it myself in about a month or so when I am done with other deadlines.
The AlphaZero.jl codebase is meant to be small and accessible and so you should not hesitate to dig into it yourself and make experiments changing stuff. If you have trouble doing so or find things unclear, consider leaving some feedback.

I am sorry I cannot be more helpful right now and I hope you can still manage to use AlphaZero.jl productively for your purposes.

@jonathan-laurent I understand. Thank you. I will take it into a consideration.

For completeness, of my "simple test" I am enclosing results of iteration 12 run on Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz (2 sockets) with OMP_NUM_THREADS=1 julia -p 56 -t 1 mytraining1.jl and the settings provided at https://github.com/jonathan-laurent/AlphaZero.jl/issues/71#issuecomment-917246949 (please see above point 4) and BLASBenchmarksCPU.jl results run with julia -t 112.

iteration_12.zip gemm_Float64_1_1000_icelake-server_AVX512__multithreaded gemm_Float64_10_200_icelake-server_AVX512__multithreaded gemm_Float64_1_75_icelake-server_AVX512__multithreaded

@johannes-fischer

The machine has multiple GeForce RTX 2080 Ti with 10GB memory.

If I may ask, how many GPUs? What are the clocks? What is your GPUs utilization?

@johannes-fischer @idevcde I pushed a first version of the profiling functionality that I mentioned previously. It may require some mall modifications but I think you may find it useful in understanding the performances of AlphaZero.jl on your applications of choice.

I might not be able to test this new functionality within the next few days, however, this is to my interest. Thank you.

Awesome, thank you for the efforts! I will definitely try it out, but also not within the next two weeks.

@idevcde I have not forgotten your question, but I didn't have the chance to try out the GPU utilization again yet. The machine has 5 GPUs but I'm not the only one using it. How can I determine the GPU clock speed?

@johannes-fischer I should be able to take a look at the first version of the profiling functionality definitely sooner than two weeks, however, as for now, I prefer not to say exactly when. It seems that as suggested by @jonathan-laurent some additional insights might be useful.

I usually use "nvidia-smi" with "-l 1" flags (nvidia-smi -l 1) which should give you a refresh interval of 1 second and "nvidia-smi -h" should bring a full list of customization flags. You may also want to try "watch -n 1 nvidia-smi" which refreshes the screen and "nvidia-smi > nvidia-output.txt" to save the output to a file. I believe there are many other ways.

One additional thing that you may find interesting (I do not know, I assume you may) in relation to this thread, particularly when sharing a machine, is to run Julia with JULIA_EXCLUSIVE=1. This will assure Julia's thread policy is consistent and threads are affinitized.

iteraition_13_15.zip

Enclosed please find reports for iteration 13 to 15 and detailed data for iteration 15 (training time ~11h) below. I understand number 15 is exhausting planned iterations in default version of AlphaZero.jl. For iteration 15 I collected profiling data. It was a different software than proposed by @jonathan-laurent, however, I belive the data is correct. Overall I noticed physical core utilization at a level of 11.1% (6.193 out of 56 cores), with more or less full physical core utilization for the first part of the training. I would like once again kindly point your attention to my questions included in https://github.com/jonathan-laurent/AlphaZero.jl/issues/71#issuecomment-917246949:

"What do you think about changing the number of MKL threads from 1 to the number of physical cores after the "Self play" stage? Do you think that it may be possible and could bring positive results? [...] If you think that this might not be possible, do you think that saving the "Self play" stage to "in memory file" and starting julia with different settings might bring any positive result?"

I am currently trying the new profiling functionality proposed at https://github.com/jonathan-laurent/AlphaZero.jl/issues/71#issuecomment-925995176. I still have to understand it better. Any additional instructions on its use, particularly with a focus on CPUs and non GUI environments would be appreciated.

  #################################
  Run command:
  #################################
  OMP_NUM_THREADS=1 julia -p 56 /home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/mytraining1.jl

  #################################
  PROFILING
  #################################
  Elapsed Time: 39372.524s
      IPC: 3.100
      SP GFLOPS: 296.854
      DP GFLOPS: 0.001
      x87 GFLOPS: 0.000
      Average CPU Frequency: 2.913 GHz 
  Logical Core Utilization: 5.5% (6.208 out of 112)
  Physical Core Utilization: 11.1% (6.193 out of 56)

  Microarchitecture Usage: 60.9% of Pipeline Slots
      Retiring: 60.9% of Pipeline Slots
      Front-End Bound: 5.6% of Pipeline Slots
      Bad Speculation: 1.4% of Pipeline Slots
      Back-End Bound: 32.1% of Pipeline Slots

  Memory Bound: 14.5% of Pipeline Slots
      L1 Bound: 3.1% of Clockticks
      L2 Bound: 0.7% of Clockticks
      L3 Bound: 0.9% of Clockticks
      L3 Latency: 1.4% of Clockticks
      DRAM Bound: 4.3% of Clockticks
      Memory Bandwidth: 9.1% of Clockticks
      Memory Latency: 15.0% of Clockticks
      Local DRAM: 7.7% of Clockticks
      Remote DRAM: 0.1% of Clockticks
      Remote Cache: 0.0% of Clockticks
      Store Bound: 1.7% of Clockticks
      Core Bound: 17.6% of Pipeline Slots

  Memory Bound: 14.5% of Pipeline Slots
      Cache Bound: 4.7% of Clockticks
      DRAM Bound: 4.3% of Clockticks
      NUMA: % of Remote Accesses: 1.3%

  Vectorization: 94.9% of Packed FP Operations
      Instruction Mix
      SP FLOPs: 35.5% of uOps
      Packed: 94.9% from SP FP
          128-bit: 0.0% from SP FP
           256-bit: 0.0% from SP FP
           512-bit: 94.9% from SP FP
           Scalar: 5.1% from SP FP
      DP FLOPs: 0.0% of uOps
      Packed: 0.0% from DP FP
           128-bit: 0.0% from DP FP
           256-bit: 0.0% from DP FP
           512-bit: 0.0% from DP FP
           Scalar: 100.0% from DP FP
      x87 FLOPs: 0.0% of uOps
      Non-FP: 64.5% of uOps
      FP Arith/Mem Rd Instr. Ratio: 0.842
      FP Arith/Mem Wr Instr. Ratio: 5.197

  Collection and Platform Info
      Collection start time: 23:11:43 30/09/2021 UTC
      Collection stop time: 10:07:55 01/10/2021 UTC
      Collector Type: Driverless Perf per-process counting

  CPU
          Name: Intel(R) Xeon(R) Processor code named Icelake
          Frequency: 2.594 GHz 
          Logical CPU Count: 112
          Cache Allocation Technology
              Level 2 capability: not detected
              Level 3 capability: available

    Operating System: Ubuntu 20.04.3 LTS
              Kernel: Linux 5.4.0-77-generic
        Architecture: x86-64

  #################################
  julia version
  #################################
  [ Info: 1.7.0-beta3.0
  ┌ Info: Threads.nthreads()
  └   Threads.nthreads() = 1
  ┌ Info: Sys.CPU_NAME
  └   Sys.CPU_NAME = "icelake-server"

  Julia Version 1.7.0-beta3.0
  Commit e76c9dad42 (2021-07-07 08:12 UTC)
  Platform Info:
    OS: Linux (x86_64-pc-linux-gnu)
    CPU: Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz
    WORD_SIZE: 64
    LIBM: libopenlibm
    LLVM: libLLVM-12.0.0 (ORCJIT, icelake-server)

  #################################
  JULIA CODE
  #################################
  @everywhere begin
      using LinearAlgebra
      using MKL
      BLAS.set_num_threads(1)
      using AlphaZero
      experiment = Examples.experiments["connect-four"]
  end

  session = Session(experiment, dir="/home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/sessions/connect-four")
  resume!(session)

  #################################
  BLAS
  #################################
  ┌ Info: BLAS.get_config()
  │   BLAS.get_config() =
  │    LinearAlgebra.BLAS.LBTConfig
  │    Libraries: 
  └    └ [ILP64] libmkl_rt.so
  ┌ Info: BLAS.get_num_threads()
  └   BLAS.get_num_threads() = 1
        Status `~/.julia/environments/v1.7/Project.toml`
    [8ed9eb0b] AlphaZero v0.5.2
  ┌ Info: Pkg.status()
  └   Pkg.status("AlphaZero") = nothing

  Loading environment from: /home/xxxxxx/data/JuliaCon21/AlphaZero.jl/mytrainings/sessions/connect-four

  [ Info: Using modified parameters
  Starting iteration 15

    Starting self-play

      Progress:   0%|                                     |  ETA: 21.60 days[K
      Progress:   0%|                                     |  ETA: 15.63 days[K
      Progress:   0%|                                     |  ETA: 12.13 days[K
      Progress:   0%|                                     |  ETA: 10.50 days[K
      Progress:   0%|                                     |  ETA: 8 days, 20:04:14[K
      Progress:   0%|                                     |  ETA: 7 days, 17:48:00[K
      Progress:   0%|                                     |  ETA: 6 days, 23:14:45[K
      Progress:   0%|▏                                    |  ETA: 6 days, 4:58:22[K
      Progress:   0%|▏                                    |  ETA: 5 days, 14:58:42[K
      Progress:   0%|▏                                    |  ETA: 5 days, 4:16:24[K
      Progress:   0%|▏                                    |  ETA: 4 days, 19:34:03[K
      Progress:   0%|▏                                    |  ETA: 4 days, 12:47:29[K
      Progress:   0%|▏                                    |  ETA: 4 days, 5:14:46[K
      Progress:   0%|▏                                    |  ETA: 3 days, 23:48:33[K
      Progress:   0%|▏                                    |  ETA: 3 days, 19:08:33[K
      Progress:   0%|▏                                    |  ETA: 3 days, 14:07:59[K
      Progress:   0%|▏                                    |  ETA: 3 days, 10:08:54[K
      Progress:   0%|▏                                    |  ETA: 3 days, 7:07:40[K
      Progress:   0%|▏                                    |  ETA: 3 days, 4:03:34[K
      Progress:   0%|▏                                    |  ETA: 3 days, 0:47:42[K
      Progress:   0%|▏                                    |  ETA: 2 days, 21:50:57[K
      Progress:   0%|▏                                    |  ETA: 2 days, 19:10:15[K
      Progress:   0%|▏                                    |  ETA: 2 days, 17:16:21[K
      Progress:   0%|▏                                    |  ETA: 2 days, 14:53:09[K
      Progress:   1%|▎                                    |  ETA: 2 days, 12:52:21[K
      Progress:   1%|▎                                    |  ETA: 2 days, 10:49:36[K
      Progress:   1%|▎                                    |  ETA: 2 days, 9:32:21[K
      Progress:   1%|▎                                    |  ETA: 2 days, 8:29:52[K
      Progress:   1%|▎                                    |  ETA: 2 days, 8:06:18[K
      Progress:   1%|▎                                    |  ETA: 2 days, 6:17:58[K
      Progress:   1%|▎                                    |  ETA: 2 days, 4:48:09[K
      Progress:   1%|▎                                    |  ETA: 2 days, 3:13:18[K
      Progress:   1%|▎                                    |  ETA: 2 days, 1:44:58[K
      Progress:   1%|▎                                    |  ETA: 2 days, 0:20:31[K
      Progress:   1%|▎                                    |  ETA: 1 days, 23:17:51[K
      Progress:   1%|▎                                    |  ETA: 1 days, 22:10:04[K
      Progress:   1%|▎                                    |  ETA: 1 days, 21:08:50[K
      Progress:   1%|▎                                    |  ETA: 1 days, 20:20:47[K
      Progress:   1%|▎                                    |  ETA: 1 days, 19:32:06[K
      Progress:   1%|▎                                    |  ETA: 1 days, 18:32:19[K
      Progress:   1%|▎                                    |  ETA: 1 days, 17:45:44[K
      Progress:   1%|▍                                    |  ETA: 1 days, 16:49:31[K
      Progress:   1%|▍                                    |  ETA: 1 days, 15:55:52[K
      Progress:   1%|▍                                    |  ETA: 1 days, 15:05:07[K
      Progress:   1%|▍                                    |  ETA: 1 days, 14:18:32[K
      Progress:   1%|▍                                    |  ETA: 1 days, 13:29:45[K
      Progress:   1%|▍                                    |  ETA: 1 days, 12:43:13[K
      Progress:   1%|▍                                    |  ETA: 1 days, 11:59:20[K
      Progress:   1%|▍                                    |  ETA: 1 days, 11:19:49[K
      Progress:   1%|▍                                    |  ETA: 1 days, 10:42:53[K
      Progress:   1%|▍                                    |  ETA: 1 days, 10:06:04[K
      Progress:   1%|▍                                    |  ETA: 1 days, 9:29:09[K
      Progress:   1%|▍                                    |  ETA: 1 days, 8:52:52[K
      Progress:   1%|▍                                    |  ETA: 1 days, 8:20:33[K
      Progress:   1%|▍                                    |  ETA: 1 days, 7:52:36[K
      Progress:   1%|▍                                    |  ETA: 1 days, 7:19:32[K
      Progress:   1%|▍                                    |  ETA: 1 days, 6:49:11[K
      Progress:   1%|▍                                    |  ETA: 1 days, 6:18:52[K
      Progress:   1%|▌                                    |  ETA: 1 days, 5:50:00[K
      Progress:   1%|▌                                    |  ETA: 1 days, 5:20:31[K
      Progress:   1%|▌                                    |  ETA: 1 days, 4:54:41[K
      Progress:   1%|▌                                    |  ETA: 1 days, 4:30:42[K
      Progress:   1%|▌                                    |  ETA: 1 days, 4:04:49[K
      Progress:   1%|▌                                    |  ETA: 1 days, 3:39:12[K
      Progress:   1%|▌                                    |  ETA: 1 days, 3:28:38[K
      Progress:   1%|▌                                    |  ETA: 1 days, 3:04:19[K
      Progress:   1%|▌                                    |  ETA: 1 days, 2:42:07[K
      Progress:   1%|▌                                    |  ETA: 1 days, 2:18:44[K
      Progress:   1%|▌                                    |  ETA: 1 days, 1:57:31[K
      Progress:   1%|▌                                    |  ETA: 1 days, 1:45:08[K
      Progress:   1%|▌                                    |  ETA: 1 days, 1:25:23[K
      Progress:   1%|▌                                    |  ETA: 1 days, 1:06:02[K
      Progress:   1%|▌                                    |  ETA: 1 days, 0:50:24[K
      Progress:   2%|▌                                    |  ETA: 1 days, 0:33:41[K
      Progress:   2%|▌                                    |  ETA: 1 days, 0:19:24[K
      Progress:   2%|▋                                    |  ETA: 1 days, 0:03:12[K
      Progress:   2%|▋                                    |  ETA: 23:44:48[K
      Progress:   2%|▋                                    |  ETA: 23:32:49[K
      Progress:   2%|▋                                    |  ETA: 23:24:05[K
      Progress:   2%|▋                                    |  ETA: 23:08:02[K
      Progress:   2%|▋                                    |  ETA: 22:56:53[K
      Progress:   2%|▋                                    |  ETA: 22:44:35[K
      Progress:   2%|▋                                    |  ETA: 22:28:43[K
      Progress:   2%|▋                                    |  ETA: 22:15:30[K
      Progress:   2%|▋                                    |  ETA: 22:06:00[K
      Progress:   2%|▋                                    |  ETA: 21:54:16[K
      Progress:   2%|▋                                    |  ETA: 21:39:54[K
      Progress:   2%|▋                                    |  ETA: 21:26:51[K
      Progress:   2%|▋                                    |  ETA: 21:13:02[K
      Progress:   2%|▋                                    |  ETA: 21:02:06[K
      Progress:   2%|▋                                    |  ETA: 20:53:49[K
      Progress:   2%|▊                                    |  ETA: 20:42:03[K
      Progress:   2%|▊                                    |  ETA: 20:31:50[K
      Progress:   2%|▊                                    |  ETA: 20:18:56[K
      Progress:   2%|▊                                    |  ETA: 20:09:22[K
      Progress:   2%|▊                                    |  ETA: 19:58:43[K
      Progress:   2%|▊                                    |  ETA: 19:50:02[K
      Progress:   2%|▊                                    |  ETA: 19:39:00[K
      Progress:   2%|▊                                    |  ETA: 19:27:56[K
      Progress:   2%|▊                                    |  ETA: 19:18:18[K
      Progress:   2%|▊                                    |  ETA: 19:07:53[K
      Progress:   2%|▊                                    |  ETA: 18:58:03[K
      Progress:   2%|▊                                    |  ETA: 18:49:18[K
      Progress:   2%|▊                                    |  ETA: 18:41:28[K
      Progress:   2%|▊                                    |  ETA: 18:22:19[K
      Progress:   2%|▊                                    |  ETA: 18:15:11[K
      Progress:   2%|▊                                    |  ETA: 18:07:49[K
      Progress:   2%|▉                                    |  ETA: 18:03:50[K
      Progress:   2%|▉                                    |  ETA: 17:55:04[K
      Progress:   2%|▉                                    |  ETA: 17:45:49[K
      Progress:   2%|▉                                    |  ETA: 17:46:48[K
      Progress:   2%|▉                                    |  ETA: 17:44:44[K
      Progress:   2%|▉                                    |  ETA: 17:35:25[K
      Progress:   2%|▉                                    |  ETA: 17:08:39[K
      Progress:   2%|▉                                    |  ETA: 16:45:22[K
      Progress:   2%|▉                                    |  ETA: 16:38:18[K
      Progress:   2%|▉                                    |  ETA: 16:31:01[K
      Progress:   2%|▉                                    |  ETA: 16:23:02[K
      Progress:   2%|▉                                    |  ETA: 16:17:21[K
      Progress:   3%|▉                                    |  ETA: 16:12:09[K
      Progress:   3%|█                                    |  ETA: 16:05:18[K
      Progress:   3%|█                                    |  ETA: 15:57:39[K
      Progress:   3%|█                                    |  ETA: 15:50:37[K
      Progress:   3%|█                                    |  ETA: 15:47:17[K
      Progress:   3%|█                                    |  ETA: 15:40:08[K
      Progress:   3%|█                                    |  ETA: 15:33:23[K
      Progress:   3%|█                                    |  ETA: 15:26:19[K
      Progress:   3%|█                                    |  ETA: 15:20:26[K
      Progress:   3%|█                                    |  ETA: 15:13:32[K
      Progress:   3%|█                                    |  ETA: 15:07:31[K
      Progress:   3%|█                                    |  ETA: 15:01:28[K
      Progress:   3%|█                                    |  ETA: 14:56:54[K
      Progress:   3%|█                                    |  ETA: 14:54:21[K
      Progress:   3%|█                                    |  ETA: 14:48:10[K
      Progress:   3%|█                                    |  ETA: 14:44:38[K
      Progress:   3%|█                                    |  ETA: 14:38:51[K
      Progress:   3%|█                                    |  ETA: 14:34:15[K
      Progress:   3%|█▏                                   |  ETA: 14:28:18[K
      Progress:   3%|█▏                                   |  ETA: 14:02:59[K
      Progress:   3%|█▏                                   |  ETA: 13:57:24[K
      Progress:   3%|█▏                                   |  ETA: 13:53:00[K
      Progress:   3%|█▏                                   |  ETA: 13:47:25[K
      Progress:   3%|█▏                                   |  ETA: 13:43:04[K
      Progress:   3%|█▏                                   |  ETA: 13:38:18[K
      Progress:   3%|█▏                                   |  ETA: 13:34:06[K
      Progress:   3%|█▏                                   |  ETA: 13:29:19[K
      Progress:   3%|█▏                                   |  ETA: 13:24:03[K
      Progress:   3%|█▏                                   |  ETA: 13:19:14[K
      Progress:   3%|█▏                                   |  ETA: 13:15:14[K
      Progress:   3%|█▏                                   |  ETA: 13:11:05[K
      Progress:   3%|█▎                                   |  ETA: 13:06:26[K
      Progress:   3%|█▎                                   |  ETA: 13:01:32[K
      Progress:   3%|█▎                                   |  ETA: 12:57:01[K
      Progress:   3%|█▎                                   |  ETA: 12:53:35[K
      Progress:   3%|█▎                                   |  ETA: 12:49:50[K
      Progress:   3%|█▎                                   |  ETA: 12:46:18[K
      Progress:   3%|█▎                                   |  ETA: 12:42:52[K
      Progress:   3%|█▎                                   |  ETA: 12:38:36[K
      Progress:   3%|█▎                                   |  ETA: 12:36:00[K
      Progress:   3%|█▎                                   |  ETA: 12:35:16[K
      Progress:   3%|█▎                                   |  ETA: 12:32:13[K
      Progress:   3%|█▎                                   |  ETA: 12:28:25[K
      Progress:   3%|█▎                                   |  ETA: 12:25:48[K
      Progress:   3%|█▎                                   |  ETA: 12:22:54[K
      Progress:   4%|█▎                                   |  ETA: 12:19:05[K
      Progress:   4%|█▎                                   |  ETA: 12:14:57[K
      Progress:   4%|█▎                                   |  ETA: 12:11:05[K
      Progress:   4%|█▍                                   |  ETA: 12:07:19[K
      Progress:   4%|█▍                                   |  ETA: 12:03:42[K
      Progress:   4%|█▍                                   |  ETA: 11:59:40[K
      Progress:   4%|█▍                                   |  ETA: 11:55:52[K
      Progress:   4%|█▍                                   |  ETA: 11:52:51[K
      Progress:   4%|█▍                                   |  ETA: 11:50:23[K
      Progress:   4%|█▍                                   |  ETA: 11:46:31[K
      Progress:   4%|█▍                                   |  ETA: 11:43:15[K
      Progress:   4%|█▍                                   |  ETA: 11:40:00[K
      Progress:   4%|█▍                                   |  ETA: 11:36:29[K
      Progress:   4%|█▍                                   |  ETA: 11:32:48[K
      Progress:   4%|█▍                                   |  ETA: 11:29:35[K
      Progress:   4%|█▍                                   |  ETA: 11:27:18[K
      Progress:   4%|█▍                                   |  ETA: 11:24:21[K
      Progress:   4%|█▍                                   |  ETA: 11:22:54[K
      Progress:   4%|█▍                                   |  ETA: 11:22:05[K
      Progress:   4%|█▍                                   |  ETA: 11:18:40[K
      Progress:   4%|█▌                                   |  ETA: 11:15:23[K
      Progress:   4%|█▌                                   |  ETA: 11:11:50[K
      Progress:   4%|█▌                                   |  ETA: 11:08:56[K
      Progress:   4%|█▌                                   |  ETA: 11:02:33[K
      Progress:   4%|█▌                                   |  ETA: 11:00:20[K
      Progress:   4%|█▌                                   |  ETA: 10:57:59[K
      Progress:   4%|█▌                                   |  ETA: 10:55:02[K
      Progress:   4%|█▌                                   |  ETA: 10:52:08[K
      Progress:   4%|█▌                                   |  ETA: 10:49:34[K
      Progress:   4%|█▌                                   |  ETA: 10:46:25[K
      Progress:   4%|█▌                                   |  ETA: 10:43:19[K
      Progress:   4%|█▌                                   |  ETA: 10:37:18[K
      Progress:   4%|█▌                                   |  ETA: 10:35:11[K
      Progress:   4%|█▌                                   |  ETA: 10:32:30[K
      Progress:   4%|█▌                                   |  ETA: 10:29:27[K
      Progress:   4%|█▋                                   |  ETA: 10:26:32[K
      Progress:   4%|█▋                                   |  ETA: 10:24:06[K
      Progress:   4%|█▋                                   |  ETA: 10:22:36[K
      Progress:   4%|█▋                                   |  ETA: 10:19:39[K
      Progress:   4%|█▋                                   |  ETA: 10:17:43[K
      Progress:   4%|█▋                                   |  ETA: 10:14:54[K
      Progress:   4%|█▋                                   |  ETA: 10:13:02[K
      Progress:   4%|█▋                                   |  ETA: 10:10:28[K
      Progress:   4%|█▋                                   |  ETA: 10:09:39[K
      Progress:   4%|█▋                                   |  ETA: 10:07:03[K
      Progress:   4%|█▋                                   |  ETA: 10:04:44[K
      Progress:   4%|█▋                                   |  ETA: 9:59:51[K
      Progress:   4%|█▋                                   |  ETA: 9:57:38[K
      Progress:   5%|█▋                                   |  ETA: 9:56:49[K
      Progress:   5%|█▋                                   |  ETA: 9:54:10[K
      Progress:   5%|█▋                                   |  ETA: 9:51:30[K
      Progress:   5%|█▊                                   |  ETA: 9:49:59[K
      Progress:   5%|█▊                                   |  ETA: 9:48:18[K
      Progress:   5%|█▊                                   |  ETA: 9:47:32[K
      Progress:   5%|█▊                                   |  ETA: 9:45:48[K
      Progress:   5%|█▊                                   |  ETA: 9:47:20[K
      Progress:   5%|█▊                                   |  ETA: 9:46:39[K
      Progress:   5%|█▊                                   |  ETA: 9:45:28[K
      Progress:   5%|█▊                                   |  ETA: 9:43:38[K
      Progress:   5%|█▊                                   |  ETA: 9:41:43[K
      Progress:   5%|█▊                                   |  ETA: 9:39:29[K
      Progress:   5%|█▊                                   |  ETA: 9:37:36[K
      Progress:   5%|█▊                                   |  ETA: 9:36:01[K
      Progress:   5%|█▊                                   |  ETA: 9:37:53[K
      Progress:   5%|█▊                                   |  ETA: 9:35:38[K
      Progress:   5%|█▊                                   |  ETA: 9:33:30[K
      Progress:   5%|█▊                                   |  ETA: 9:31:13[K
      Progress:   5%|█▉                                   |  ETA: 9:29:04[K
      Progress:   5%|█▉                                   |  ETA: 9:27:03[K
      Progress:   5%|█▉                                   |  ETA: 9:24:48[K
      Progress:   5%|█▉                                   |  ETA: 9:23:29[K
      Progress:   5%|█▉                                   |  ETA: 9:21:26[K
      Progress:   5%|█▉                                   |  ETA: 9:19:51[K
      Progress:   5%|█▉                                   |  ETA: 9:18:01[K
      Progress:   5%|█▉                                   |  ETA: 9:15:48[K
      Progress:   5%|█▉                                   |  ETA: 9:14:47[K
      Progress:   5%|█▉                                   |  ETA: 9:10:57[K
      Progress:   5%|█▉                                   |  ETA: 9:09:00[K
      Progress:   5%|█▉                                   |  ETA: 9:08:32[K
      Progress:   5%|█▉                                   |  ETA: 9:06:55[K
      Progress:   5%|█▉                                   |  ETA: 9:05:06[K
      Progress:   5%|█▉                                   |  ETA: 9:03:02[K
      Progress:   5%|█▉                                   |  ETA: 9:01:31[K
      Progress:   5%|██                                   |  ETA: 8:59:24[K
      Progress:   5%|██                                   |  ETA: 8:58:18[K
      Progress:   5%|██                                   |  ETA: 8:56:13[K
      Progress:   5%|██                                   |  ETA: 8:58:38[K
      Progress:   5%|██                                   |  ETA: 8:56:44[K
      Progress:   5%|██                                   |  ETA: 8:55:13[K
      Progress:   5%|██                                   |  ETA: 8:53:32[K
      Progress:   5%|██                                   |  ETA: 8:53:08[K
      Progress:   5%|██                                   |  ETA: 8:51:19[K
      Progress:   5%|██                                   |  ETA: 8:49:27[K
      Progress:   5%|██                                   |  ETA: 8:47:36[K
      Progress:   5%|██                                   |  ETA: 8:48:51[K
      Progress:   5%|██                                   |  ETA: 8:48:10[K
      Progress:   6%|██                                   |  ETA: 8:46:47[K
      Progress:   6%|██                                   |  ETA: 8:45:38[K
      Progress:   6%|██                                   |  ETA: 8:43:50[K
      Progress:   6%|██                                   |  ETA: 8:42:17[K
      Progress:   6%|██▏                                  |  ETA: 8:40:57[K
      Progress:   6%|██▏                                  |  ETA: 8:39:08[K
      Progress:   6%|██▏                                  |  ETA: 8:37:40[K
      Progress:   6%|██▏                                  |  ETA: 8:36:31[K
      Progress:   6%|██▏                                  |  ETA: 8:34:45[K
      Progress:   6%|██▏                                  |  ETA: 8:33:05[K
      Progress:   6%|██▏                                  |  ETA: 8:31:43[K
      Progress:   6%|██▏                                  |  ETA: 8:30:21[K
      Progress:   6%|██▏                                  |  ETA: 8:29:40[K
      Progress:   6%|██▏                                  |  ETA: 8:28:18[K
      Progress:   6%|██▏                                  |  ETA: 8:26:41[K
      Progress:   6%|██▏                                  |  ETA: 8:25:01[K
      Progress:   6%|██▏                                  |  ETA: 8:23:18[K
      Progress:   6%|██▏                                  |  ETA: 8:21:45[K
      Progress:   6%|██▏                                  |  ETA: 8:19:57[K
      Progress:   6%|██▏                                  |  ETA: 8:18:24[K
      Progress:   6%|██▏                                  |  ETA: 8:17:12[K
      Progress:   6%|██▎                                  |  ETA: 8:16:29[K
      Progress:   6%|██▎                                  |  ETA: 8:14:53[K
      Progress:   6%|██▎                                  |  ETA: 8:13:18[K
      Progress:   6%|██▎                                  |  ETA: 8:11:36[K
      Progress:   6%|██▎                                  |  ETA: 8:10:28[K
      Progress:   6%|██▎                                  |  ETA: 8:09:10[K
      Progress:   6%|██▎                                  |  ETA: 8:07:29[K
      Progress:   6%|██▎                                  |  ETA: 8:07:26[K
      Progress:   6%|██▎                                  |  ETA: 8:05:47[K
      Progress:   6%|██▎                                  |  ETA: 8:04:32[K
      Progress:   6%|██▎                                  |  ETA: 8:02:56[K
      Progress:   6%|██▎                                  |  ETA: 8:02:06[K
      Progress:   6%|██▎                                  |  ETA: 8:00:33[K
      Progress:   6%|██▎                                  |  ETA: 7:59:50[K
      Progress:   6%|██▎                                  |  ETA: 7:58:19[K
      Progress:   6%|██▎                                  |  ETA: 7:58:57[K
      Progress:   6%|██▎                                  |  ETA: 7:57:34[K
      Progress:   6%|██▍                                  |  ETA: 7:57:43[K
      Progress:   6%|██▍                                  |  ETA: 7:56:14[K
      Progress:   6%|██▍                                  |  ETA: 7:55:04[K
      Progress:   6%|██▍                                  |  ETA: 7:54:38[K
      Progress:   6%|██▍                                  |  ETA: 7:54:02[K
      Progress:   6%|██▍                                  |  ETA: 7:52:53[K
      Progress:   6%|██▍                                  |  ETA: 7:52:13[K
      Progress:   6%|██▍                                  |  ETA: 7:50:42[K
      Progress:   6%|██▍                                  |  ETA: 7:50:26[K
      Progress:   6%|██▍                                  |  ETA: 7:49:08[K
      Progress:   6%|██▍                                  |  ETA: 7:48:18[K
      Progress:   6%|██▍                                  |  ETA: 7:47:49[K
      Progress:   6%|██▍                                  |  ETA: 7:47:00[K
      Progress:   7%|██▍                                  |  ETA: 7:45:36[K
      Progress:   7%|██▍                                  |  ETA: 7:44:52[K
      Progress:   7%|██▍                                  |  ETA: 7:43:55[K
      Progress:   7%|██▌                                  |  ETA: 7:40:58[K
      Progress:   7%|██▌                                  |  ETA: 7:39:51[K
      Progress:   7%|██▌                                  |  ETA: 7:38:57[K
      Progress:   7%|██▌                                  |  ETA: 7:38:10[K
      Progress:   7%|██▌                                  |  ETA: 7:36:54[K
      Progress:   7%|██▌                                  |  ETA: 7:35:56[K
      Progress:   7%|██▌                                  |  ETA: 7:35:09[K
      Progress:   7%|██▌                                  |  ETA: 7:34:15[K
      Progress:   7%|██▌                                  |  ETA: 7:32:55[K
      Progress:   7%|██▌                                  |  ETA: 7:31:39[K
      Progress:   7%|██▌                                  |  ETA: 7:30:59[K
      Progress:   7%|██▌                                  |  ETA: 7:30:54[K
      Progress:   7%|██▌                                  |  ETA: 7:29:48[K
      Progress:   7%|██▌                                  |  ETA: 7:28:42[K
      Progress:   7%|██▌                                  |  ETA: 7:27:44[K
      Progress:   7%|██▌                                  |  ETA: 7:26:51[K
      Progress:   7%|██▋                                  |  ETA: 7:25:33[K
      Progress:   7%|██▋                                  |  ETA: 7:24:14[K
      Progress:   7%|██▋                                  |  ETA: 7:23:16[K
      Progress:   7%|██▋                                  |  ETA: 7:22:08[K
      Progress:   7%|██▋                                  |  ETA: 7:20:57[K
      Progress:   7%|██▋                                  |  ETA: 7:20:08[K
      Progress:   7%|██▋                                  |  ETA: 7:18:57[K
      Progress:   7%|██▋                                  |  ETA: 7:17:56[K
      Progress:   7%|██▋                                  |  ETA: 7:17:26[K
      Progress:   7%|██▋                                  |  ETA: 7:14:56[K
      Progress:   7%|██▋                                  |  ETA: 7:13:50[K
      Progress:   7%|██▋                                  |  ETA: 7:12:55[K
      Progress:   7%|██▋                                  |  ETA: 7:12:08[K
      Progress:   7%|██▋                                  |  ETA: 7:10:53[K
      Progress:   7%|██▋                                  |  ETA: 7:10:53[K
      Progress:   7%|██▋                                  |  ETA: 7:09:56[K
      Progress:   7%|██▊                                  |  ETA: 7:06:20[K
      Progress:   8%|██▊                                  |  ETA: 6:57:36[K
      Progress:   8%|██▊                                  |  ETA: 6:56:43[K
      Progress:   8%|██▊                                  |  ETA: 6:55:40[K
      Progress:   8%|██▊                                  |  ETA: 6:54:48[K
      Progress:   8%|██▉                                  |  ETA: 6:52:49[K
      Progress:   8%|██▉                                  |  ETA: 6:52:29[K
      Progress:   8%|██▉                                  |  ETA: 6:51:57[K
      Progress:   8%|██▉                                  |  ETA: 6:49:04[K
      Progress:   8%|██▉                                  |  ETA: 6:48:26[K
      Progress:   8%|██▉                                  |  ETA: 6:48:08[K
      Progress:   8%|██▉                                  |  ETA: 6:48:02[K
      Progress:   8%|██▉                                  |  ETA: 6:46:56[K
      Progress:   8%|██▉                                  |  ETA: 6:46:16[K
      Progress:   8%|██▉                                  |  ETA: 6:45:12[K
      Progress:   8%|██▉                                  |  ETA: 6:44:32[K
      Progress:   8%|██▉                                  |  ETA: 6:43:53[K
      Progress:   8%|██▉                                  |  ETA: 6:42:49[K
      Progress:   8%|██▉                                  |  ETA: 6:41:59[K
      Progress:   8%|███                                  |  ETA: 6:41:09[K
      Progress:   8%|███                                  |  ETA: 6:40:31[K
      Progress:   8%|███                                  |  ETA: 6:39:41[K
      Progress:   8%|███                                  |  ETA: 6:38:38[K
      Progress:   8%|███                                  |  ETA: 6:39:05[K
      Progress:   8%|███                                  |  ETA: 6:38:02[K
      Progress:   8%|███                                  |  ETA: 6:37:12[K
      Progress:   8%|███                                  |  ETA: 6:36:16[K
      Progress:   8%|███                                  |  ETA: 6:36:08[K
      Progress:   8%|███                                  |  ETA: 6:35:21[K
      Progress:   8%|███                                  |  ETA: 6:35:10[K
      Progress:   8%|███                                  |  ETA: 6:34:08[K
      Progress:   8%|███                                  |  ETA: 6:33:26[K
      Progress:   8%|███                                  |  ETA: 6:32:27[K
      Progress:   8%|███                                  |  ETA: 6:31:44[K
      Progress:   8%|███                                  |  ETA: 6:31:39[K
      Progress:   8%|███                                  |  ETA: 6:30:41[K
      Progress:   8%|███▏                                 |  ETA: 6:29:58[K
      Progress:   8%|███▏                                 |  ETA: 6:29:09[K
      Progress:   8%|███▏                                 |  ETA: 6:28:09[K
      Progress:   8%|███▏                                 |  ETA: 6:27:11[K
      Progress:   8%|███▏                                 |  ETA: 6:26:22[K
      Progress:   8%|███▏                                 |  ETA: 6:25:33[K
      Progress:   8%|███▏                                 |  ETA: 6:24:35[K
      Progress:   8%|███▏                                 |  ETA: 6:23:45[K
      Progress:   8%|███▏                                 |  ETA: 6:23:26[K
      Progress:   8%|███▏                                 |  ETA: 6:22:40[K
      Progress:   8%|███▏                                 |  ETA: 6:21:54[K
      Progress:   8%|███▏                                 |  ETA: 6:21:05[K
      Progress:   9%|███▏                                 |  ETA: 6:20:33[K
      Progress:   9%|███▏                                 |  ETA: 6:20:03[K
      Progress:   9%|███▏                                 |  ETA: 6:19:17[K
      Progress:   9%|███▏                                 |  ETA: 6:18:26[K
      Progress:   9%|███▏                                 |  ETA: 6:17:50[K
      Progress:   9%|███▎                                 |  ETA: 6:17:19[K
      Progress:   9%|███▎                                 |  ETA: 6:16:42[K
      Progress:   9%|███▎                                 |  ETA: 6:16:59[K
      Progress:   9%|███▎                                 |  ETA: 6:16:04[K
      Progress:   9%|███▎                                 |  ETA: 6:15:27[K
      Progress:   9%|███▎                                 |  ETA: 6:14:34[K
      Progress:   9%|███▎                                 |  ETA: 6:13:41[K
      Progress:   9%|███▎                                 |  ETA: 6:13:58[K
      Progress:   9%|███▎                                 |  ETA: 6:13:15[K
      Progress:   9%|███▎                                 |  ETA: 6:12:58[K
      Progress:   9%|███▎                                 |  ETA: 6:12:15[K
      Progress:   9%|███▎                                 |  ETA: 6:11:41[K
      Progress:   9%|███▎                                 |  ETA: 6:10:50[K
      Progress:   9%|███▎                                 |  ETA: 6:10:47[K
      Progress:   9%|███▎                                 |  ETA: 6:10:03[K
      Progress:   9%|███▎                                 |  ETA: 6:09:21[K
      Progress:   9%|███▎                                 |  ETA: 6:08:31[K
      Progress:   9%|███▍                                 |  ETA: 6:08:22[K
      Progress:   9%|███▍                                 |  ETA: 6:08:21[K
      Progress:   9%|███▍                                 |  ETA: 6:07:42[K
      Progress:   9%|███▍                                 |  ETA: 6:06:49[K
      Progress:   9%|███▍                                 |  ETA: 6:06:04[K
      Progress:   9%|███▍                                 |  ETA: 6:05:30[K
      Progress:   9%|███▍                                 |  ETA: 6:04:56[K
      Progress:   9%|███▍                                 |  ETA: 6:04:44[K
      Progress:   9%|███▍                                 |  ETA: 6:04:09[K
      Progress:   9%|███▍                                 |  ETA: 6:03:24[K
      Progress:   9%|███▍                                 |  ETA: 6:02:35[K
      Progress:   9%|███▍                                 |  ETA: 6:00:52[K
      Progress:   9%|███▍                                 |  ETA: 6:00:38[K
      Progress:   9%|███▍                                 |  ETA: 5:59:51[K
      Progress:   9%|███▍                                 |  ETA: 5:59:36[K
      Progress:   9%|███▍                                 |  ETA: 5:58:48[K
      Progress:   9%|███▌                                 |  ETA: 5:58:20[K
      Progress:   9%|███▌                                 |  ETA: 5:57:33[K
      Progress:   9%|███▌                                 |  ETA: 5:56:50[K
      Progress:   9%|███▌                                 |  ETA: 5:56:01[K
      Progress:   9%|███▌                                 |  ETA: 5:55:17[K
      Progress:   9%|███▌                                 |  ETA: 5:54:59[K
      Progress:   9%|███▌                                 |  ETA: 5:54:21[K
      Progress:   9%|███▌                                 |  ETA: 5:53:33[K
      Progress:   9%|███▌                                 |  ETA: 5:52:49[K
      Progress:   9%|███▌                                 |  ETA: 5:52:30[K
      Progress:  10%|███▌                                 |  ETA: 5:51:44[K
      Progress:  10%|███▌                                 |  ETA: 5:51:03[K
      Progress:  10%|███▌                                 |  ETA: 5:50:36[K
      Progress:  10%|███▌                                 |  ETA: 5:50:58[K
      Progress:  10%|███▌                                 |  ETA: 5:50:36[K
      Progress:  10%|███▋                                 |  ETA: 5:44:00[K
      Progress:  10%|███▋                                 |  ETA: 5:42:40[K
      Progress:  10%|███▋                                 |  ETA: 5:41:55[K
      Progress:  10%|███▋                                 |  ETA: 5:41:18[K
      Progress:  10%|███▋                                 |  ETA: 5:40:40[K
      Progress:  10%|███▋                                 |  ETA: 5:39:58[K
      Progress:  10%|███▋                                 |  ETA: 5:39:28[K
      Progress:  10%|███▋                                 |  ETA: 5:38:54[K
      Progress:  10%|███▋                                 |  ETA: 5:38:16[K
      Progress:  10%|███▊                                 |  ETA: 5:37:42[K
      Progress:  10%|███▊                                 |  ETA: 5:38:24[K
      Progress:  10%|███▊                                 |  ETA: 5:37:45[K
      Progress:  10%|███▊                                 |  ETA: 5:37:25[K
      Progress:  10%|███▊                                 |  ETA: 5:36:41[K
      Progress:  10%|███▊                                 |  ETA: 5:35:25[K
      Progress:  10%|███▊                                 |  ETA: 5:35:24[K
      Progress:  10%|███▊                                 |  ETA: 5:35:09[K
      Progress:  10%|███▊                                 |  ETA: 5:34:28[K
      Progress:  10%|███▊                                 |  ETA: 5:34:01[K
      Progress:  10%|███▊                                 |  ETA: 5:33:30[K
      Progress:  10%|███▊                                 |  ETA: 5:33:00[K
      Progress:  10%|███▊                                 |  ETA: 5:32:22[K
      Progress:  10%|███▊                                 |  ETA: 5:32:09[K
      Progress:  10%|███▊                                 |  ETA: 5:31:29[K
      Progress:  10%|███▊                                 |  ETA: 5:31:18[K
      Progress:  10%|███▉                                 |  ETA: 5:27:14[K
      Progress:  10%|███▉                                 |  ETA: 5:26:47[K
      Progress:  10%|███▉                                 |  ETA: 5:26:08[K
      Progress:  10%|███▉                                 |  ETA: 5:25:30[K
      [...]
      Progress: 100%|█████████████████████████████████████| Time: 1:16:13[K

      Generating 33 samples per second on average
      Average exploration depth: 8.9
      MCTS memory footprint per worker: 6.11MB
      Experience buffer size: 960,000 (412,452 distinct boards)

    Starting learning

      Optimizing the loss

           Loss       Lv       Lp     Lreg     Linv       Hp    Hpnet
         0.3328   0.0911   0.1450   0.0964   0.0002   0.7501   0.8812
         0.3134   0.0812   0.1383   0.0936   0.0001   0.7501   0.8821

      Launching a checkpoint evaluation

        Progress:   2%|▌                                  |  ETA: 5 days, 0:41:54[K
        Progress:   2%|▉                                  |  ETA: 3 days, 14:13:36[K
        Progress:   3%|█▏                                 |  ETA: 2 days, 17:34:43[K
        Progress:   4%|█▍                                 |  ETA: 2 days, 4:39:21[K
        Progress:   5%|█▋                                 |  ETA: 1 days, 20:26:14[K
        Progress:   5%|█▉                                 |  ETA: 1 days, 14:41:38[K
        Progress:   6%|██▎                                |  ETA: 1 days, 9:35:52[K
        Progress:   7%|██▌                                |  ETA: 1 days, 5:59:06[K
        Progress:   8%|██▊                                |  ETA: 1 days, 2:59:23[K
        Progress:   9%|███                                |  ETA: 1 days, 0:22:12[K
        Progress:   9%|███▎                               |  ETA: 22:10:58[K
        Progress:  10%|███▌                               |  ETA: 20:23:51[K
        [...]
        Progress: 100%|███████████████████████████████████| Time: 2:39:07[K

        Average reward: +0.02 (48% won, 5% draw, 46% lost), redundancy: 51.3%

    Running benchmark: AlphaZero against MCTS (1000 rollouts)

      Progress:   1%|▎                                    |  ETA: 3 days, 0:33:26[K
      Progress:   1%|▍                                    |  ETA: 2 days, 0:15:56[K
      Progress:   2%|▋                                    |  ETA: 1 days, 14:38:46[K
      Progress:   2%|▊                                    |  ETA: 1 days, 7:10:01[K
      Progress:   2%|▉                                    |  ETA: 1 days, 1:56:41[K
      Progress:   3%|█                                    |  ETA: 22:23:45[K
      Progress:   3%|█▏                                   |  ETA: 19:35:21[K
      Progress:   4%|█▎                                   |  ETA: 17:26:49[K
      Progress:   4%|█▌                                   |  ETA: 16:15:39[K
      Progress:   4%|█▋                                   |  ETA: 14:58:03[K
      Progress:   5%|█▊                                   |  ETA: 14:04:09[K
      Progress:   5%|█▉                                   |  ETA: 13:12:07[K
      Progress:   5%|██                                   |  ETA: 12:15:36[K
      Progress:   6%|██▏                                  |  ETA: 11:34:39[K
      Progress:   6%|██▎                                  |  ETA: 10:56:51[K
      Progress:   7%|██▌                                  |  ETA: 10:18:58[K
      Progress:   7%|██▋                                  |  ETA: 9:44:14[K
      Progress:   7%|██▊                                  |  ETA: 9:25:27[K
      Progress:   8%|██▉                                  |  ETA: 8:56:26[K
      Progress:   8%|███                                  |  ETA: 8:46:41[K
      Progress:   9%|███▏                                 |  ETA: 8:41:20[K
      Progress:   9%|███▍                                 |  ETA: 8:26:19[K
      Progress:   9%|███▌                                 |  ETA: 8:04:19[K
      Progress:  10%|███▋                                 |  ETA: 7:43:45[K
      Progress:  10%|███▊                                 |  ETA: 7:28:47[K
      [...]
      Progress: 100%|█████████████████████████████████████| Time: 1:32:22[K

      Average reward: +1.00 (100% won, 0% draw, 0% lost), redundancy: 29.0%

    Running benchmark: Network Only against MCTS (1000 rollouts)

      Progress: 100%|█████████████████████████████████████| Time: 0:00:42[K

  End of calculation time UTC
  Fri 01 Oct 2021 10:07:56 AM UTC

This is the output of the first try with the proposed profiling tool (https://github.com/jonathan-laurent/AlphaZero.jl/issues/76). I adjusted the code as presented below to be run on non CUDA machine. Please be informed that in contrary to the previous post I did not use MKL. Should you have any comments please let me know. Any comments wrt use of CPUs and non GUI environments would be particularly welcomed. Please be informed that I have not noticed any JSON trace file.

ADJUSTED CODE
#####
##### Profiling Self-play
#####

# using Revise; Revise.includet("scripts/profile/self_play.jl")
# profile_self_play()

using AlphaZero
using Setfield

using Profile
using LoggingExtras
using ProfileSVG
# using ProfileView # my comment out, package failed to precompile as ERROR: LoadError: InitError: Cannot open display: 
import StatProfilerHTML

# import CUDA  # my comment out

function profile_self_play(
  exp::Experiment = Examples.experiments["connect-four"];
  num_games=512,
  num_workers=256,
  batch_size=128,
  num_filters=64)

  exp = @set exp.netparams.num_filters = num_filters
  exp = @set exp.params.self_play.sim.num_workers = num_workers
  exp = @set exp.params.self_play.sim.batch_size = batch_size
  exp = @set exp.params.self_play.sim.num_games = num_games

  session = Session(exp, autosave=false, dir="sessions/profile-backprop-$(exp.name)")
  env = session.env
  UI.Log.section(session.logger, 1, "Profiling data generation")
  # CUDA.@time AlphaZero.self_play_step!(env, session)  # my comment out
  @time AlphaZero.self_play_step!(env, session) # my change (added line)
  return
end

function full_profile()
  profile_self_play() # Compilation
  profile_self_play()
  profile_self_play() # Double-checking
end

function chrome_tracing()
  global_logger(TeeLogger(
    ConsoleLogger(),
    AlphaZero.ProfUtils.chrome_tracing_logger("tracing.json")))
  profile_self_play()
end

function flame_graph()
  profile_self_play() # Compilation
  Profile.init(n=10^8, delay=0.01)
  Profile.clear()
  @profile profile_self_play()
  ProfileSVG.save("self-play-profile.svg")
  # StatProfilerHTML.statprofilehtml()
  # ProfileView.view(C=true)
end

full_profile() # my uncomment
chrome_tracing()

########################################################################
Run command
########################################################################
julia -p 56 /home/xxxxxx/data/JuliaCon21/AlphaZero.jl/scripts/profile/self_play_chrome.jl

Starting calculation time UTC
Sat 02 Oct 2021 11:54:43 AM UTC

Initializing a new AlphaZero environment

Profiling data generation

  Starting self-play

    Progress:   0%|▏                                    |  ETA: 2 days, 9:51:02[K
    Progress:   1%|▎                                    |  ETA: 1 days, 14:36:42[K
    Progress:   1%|▎                                    |  ETA: 1 days, 5:05:15[K
    Progress:   1%|▍                                    |  ETA: 23:15:27[K
    Progress:   1%|▍                                    |  ETA: 19:20:43[K
    Progress:   1%|▌                                    |  ETA: 16:37:34[K
    Progress:   2%|▋                                    |  ETA: 14:32:19[K
    Progress:   2%|▋                                    |  ETA: 12:58:28[K
    Progress:   2%|▊                                    |  ETA: 11:39:21[K
    Progress:   2%|▊                                    |  ETA: 10:36:07[K
    Progress:   2%|▉                                    |  ETA: 9:42:16[K
    Progress:   3%|█                                    |  ETA: 8:57:43[K
    Progress:   3%|█                                    |  ETA: 8:19:00[K
    Progress:   3%|█▏                                   |  ETA: 7:46:04[K
    Progress:   3%|█▏                                   |  ETA: 7:18:03[K
    Progress:   3%|█▎                                   |  ETA: 6:53:27[K
    Progress:   4%|█▎                                   |  ETA: 6:29:50[K
    Progress:   4%|█▍                                   |  ETA: 6:09:38[K
    Progress:   4%|█▌                                   |  ETA: 5:50:29[K
    Progress:   4%|█▌                                   |  ETA: 5:33:11[K
    Progress:   4%|█▋                                   |  ETA: 5:20:57[K
    Progress:   4%|█▋                                   |  ETA: 5:06:41[K
    Progress:   5%|█▊                                   |  ETA: 4:53:46[K
    Progress:   5%|█▊                                   |  ETA: 4:46:46[K
    Progress:   5%|█▉                                   |  ETA: 4:37:01[K
    Progress:   5%|██                                   |  ETA: 4:59:13[K
    Progress:   5%|██                                   |  ETA: 4:52:56[K
    Progress:   6%|██▏                                  |  ETA: 4:43:46[K
    Progress:   6%|██▏                                  |  ETA: 4:35:26[K
    Progress:   6%|██▎                                  |  ETA: 4:27:52[K
    Progress:   6%|██▎                                  |  ETA: 4:20:29[K
    Progress:   6%|██▍                                  |  ETA: 4:12:10[K
    Progress:   7%|██▌                                  |  ETA: 4:05:37[K
    Progress:   7%|██▌                                  |  ETA: 3:58:24[K
    Progress:   7%|██▋                                  |  ETA: 3:52:06[K
    Progress:   7%|██▋                                  |  ETA: 3:47:13[K
    Progress:   7%|██▊                                  |  ETA: 3:46:53[K
    Progress:   8%|██▉                                  |  ETA: 3:57:24[K
    Progress:   8%|██▉                                  |  ETA: 3:51:12[K
    Progress:   8%|███                                  |  ETA: 3:47:09[K
    Progress:   8%|███                                  |  ETA: 3:43:24[K
    Progress:   8%|███▏                                 |  ETA: 3:43:43[K
    Progress:   9%|███▏                                 |  ETA: 3:38:42[K
    Progress:   9%|███▎                                 |  ETA: 3:36:14[K
    Progress:   9%|███▍                                 |  ETA: 3:32:01[K
    Progress:   9%|███▍                                 |  ETA: 3:29:59[K
    Progress:   9%|███▌                                 |  ETA: 3:38:33[K
    Progress:  10%|███▌                                 |  ETA: 3:36:25[K
    Progress:  10%|███▋                                 |  ETA: 3:36:24[K
    Progress:  10%|███▋                                 |  ETA: 3:32:14[K
    Progress:  10%|███▊                                 |  ETA: 3:29:29[K
    Progress:  10%|███▉                                 |  ETA: 3:26:02[K
    [...]
    Progress: 100%|█████████████████████████████████████| Time: 1:39:26[K

    Generating 2 samples per second on average
    Average exploration depth: 4.2
    MCTS memory footprint per worker: 6.06MB
    Experience buffer size: 12,169 (9,964 distinct boards)
5968.539815 seconds (9.83 M allocations: 673.459 MiB, 0.00% gc time)

Initializing a new AlphaZero environment

Profiling data generation

  Starting self-play

    Progress:   0%|▏                                    |  ETA: 2 days, 9:10:52[K
    Progress:   1%|▎                                    |  ETA: 1 days, 14:07:23[K
    Progress:   1%|▎                                    |  ETA: 1 days, 4:34:34[K
    Progress:   1%|▍                                    |  ETA: 22:50:39[K
    Progress:   1%|▍                                    |  ETA: 19:03:05[K
    Progress:   1%|▌                                    |  ETA: 16:18:23[K
    Progress:   2%|▋                                    |  ETA: 14:16:30[K
    Progress:   2%|▋                                    |  ETA: 12:41:05[K
    Progress:   2%|▊                                    |  ETA: 10:26:44[K
    Progress:   2%|▉                                    |  ETA: 9:33:32[K
    Progress:   3%|█                                    |  ETA: 9:00:04[K
    Progress:   3%|█▏                                   |  ETA: 7:25:26[K
    Progress:   3%|█▎                                   |  ETA: 7:00:29[K
    Progress:   4%|█▎                                   |  ETA: 6:38:09[K
    Progress:   4%|█▍                                   |  ETA: 6:16:43[K
    Progress:   4%|█▌                                   |  ETA: 5:57:36[K
    Progress:   4%|█▌                                   |  ETA: 5:39:57[K
    Progress:   4%|█▋                                   |  ETA: 5:25:34[K
    Progress:   4%|█▋                                   |  ETA: 5:16:41[K
    Progress:   5%|█▊                                   |  ETA: 5:39:40[K
    Progress:   5%|█▊                                   |  ETA: 5:26:54[K
    Progress:   5%|█▉                                   |  ETA: 5:14:50[K
    Progress:   5%|██                                   |  ETA: 5:03:28[K
    Progress:   5%|██                                   |  ETA: 4:59:57[K
    Progress:   6%|██▏                                  |  ETA: 5:09:51[K
    Progress:   6%|██▏                                  |  ETA: 5:05:18[K
    Progress:   6%|██▎                                  |  ETA: 4:58:04[K
    Progress:   6%|██▎                                  |  ETA: 4:48:58[K
    Progress:   6%|██▍                                  |  ETA: 4:40:12[K
    Progress:   7%|██▌                                  |  ETA: 4:31:49[K
    Progress:   7%|██▌                                  |  ETA: 4:24:03[K
    Progress:   7%|██▋                                  |  ETA: 4:17:26[K
    Progress:   7%|██▋                                  |  ETA: 4:20:30[K
    Progress:   7%|██▊                                  |  ETA: 4:15:20[K
    Progress:   8%|██▉                                  |  ETA: 4:23:09[K
    Progress:   8%|██▉                                  |  ETA: 4:26:38[K
    Progress:   8%|███                                  |  ETA: 4:20:50[K
    Progress:   8%|███                                  |  ETA: 4:15:01[K
    Progress:   8%|███▏                                 |  ETA: 4:19:54[K
    Progress:   9%|███▏                                 |  ETA: 4:26:06[K
    Progress:   9%|███▎                                 |  ETA: 4:20:50[K
    Progress:   9%|███▍                                 |  ETA: 4:15:39[K
    Progress:   9%|███▍                                 |  ETA: 4:09:54[K
    Progress:   9%|███▌                                 |  ETA: 4:06:29[K
    Progress:  10%|███▌                                 |  ETA: 4:02:51[K
    Progress:  10%|███▋                                 |  ETA: 3:59:22[K
    Progress:  10%|███▋                                 |  ETA: 3:55:58[K
    Progress:  10%|███▊                                 |  ETA: 3:59:02[K
    Progress:  10%|███▉                                 |  ETA: 4:01:04[K
    [...]
    Progress: 100%|█████████████████████████████████████| Time: 1:37:30[K

    Generating 2 samples per second on average
    Average exploration depth: 4.2
    MCTS memory footprint per worker: 6.01MB
    Experience buffer size: 12,145 (10,099 distinct boards)
5850.290919 seconds (324.99 k allocations: 168.489 MiB, 0.01% gc time, 0.00% compilation time)

Initializing a new AlphaZero environment

Profiling data generation

  Starting self-play

    Progress:   0%|▏                                    |  ETA: 2 days, 8:48:05[K
    Progress:   1%|▎                                    |  ETA: 1 days, 14:04:56[K
    Progress:   1%|▎                                    |  ETA: 1 days, 4:40:02[K
    Progress:   1%|▍                                    |  ETA: 22:55:04[K
    Progress:   1%|▍                                    |  ETA: 19:04:58[K
    Progress:   1%|▌                                    |  ETA: 16:22:40[K
    Progress:   2%|▋                                    |  ETA: 14:21:56[K
    Progress:   2%|▋                                    |  ETA: 12:44:53[K
    Progress:   2%|▊                                    |  ETA: 11:27:27[K
    Progress:   2%|▊                                    |  ETA: 10:26:21[K
    Progress:   2%|▉                                    |  ETA: 9:37:27[K
    Progress:   3%|█                                    |  ETA: 8:54:12[K
    Progress:   3%|█                                    |  ETA: 10:01:52[K
    Progress:   3%|█▏                                   |  ETA: 10:31:19[K
    Progress:   3%|█▏                                   |  ETA: 9:52:46[K
    Progress:   3%|█▎                                   |  ETA: 9:18:58[K
    Progress:   4%|█▎                                   |  ETA: 8:47:19[K
    Progress:   4%|█▍                                   |  ETA: 8:19:11[K
    Progress:   4%|█▌                                   |  ETA: 7:59:55[K
    Progress:   4%|█▌                                   |  ETA: 7:47:45[K
    Progress:   4%|█▋                                   |  ETA: 7:27:27[K
    Progress:   4%|█▋                                   |  ETA: 7:07:15[K
    Progress:   5%|█▊                                   |  ETA: 6:52:19[K
    Progress:   5%|█▊                                   |  ETA: 6:36:59[K
    Progress:   5%|█▉                                   |  ETA: 6:22:19[K
    Progress:   5%|██                                   |  ETA: 6:46:08[K
    Progress:   5%|██                                   |  ETA: 6:39:12[K
    Progress:   6%|██▏                                  |  ETA: 6:26:06[K
    Progress:   6%|██▏                                  |  ETA: 6:15:55[K
    Progress:   6%|██▎                                  |  ETA: 6:05:40[K
    Progress:   6%|██▎                                  |  ETA: 6:13:23[K
    Progress:   6%|██▍                                  |  ETA: 6:04:56[K
    Progress:   7%|██▌                                  |  ETA: 5:56:14[K
    Progress:   7%|██▌                                  |  ETA: 5:48:25[K
    Progress:   7%|██▋                                  |  ETA: 5:39:49[K
    Progress:   7%|██▋                                  |  ETA: 5:31:21[K
    Progress:   7%|██▊                                  |  ETA: 5:42:55[K
    Progress:   8%|██▉                                  |  ETA: 5:40:03[K
    Progress:   8%|██▉                                  |  ETA: 5:36:09[K
    Progress:   8%|███                                  |  ETA: 5:28:48[K
    Progress:   8%|███                                  |  ETA: 5:25:18[K
    Progress:   8%|███▏                                 |  ETA: 5:19:57[K
    Progress:   9%|███▏                                 |  ETA: 5:13:45[K
    Progress:   9%|███▎                                 |  ETA: 5:18:00[K
    Progress:   9%|███▍                                 |  ETA: 5:14:53[K
    Progress:   9%|███▍                                 |  ETA: 5:10:53[K
    Progress:   9%|███▌                                 |  ETA: 5:05:17[K
    Progress:  10%|███▌                                 |  ETA: 5:01:59[K
    Progress:  10%|███▋                                 |  ETA: 4:58:51[K
    Progress:  10%|███▋                                 |  ETA: 4:53:48[K
    Progress:  10%|███▊                                 |  ETA: 4:50:35[K
    Progress:  10%|███▉                                 |  ETA: 4:44:54[K
    [...]
    Progress: 100%|█████████████████████████████████████| Time: 1:39:04[K

    Generating 2 samples per second on average
    Average exploration depth: 4.3
    MCTS memory footprint per worker: 6.01MB
    Experience buffer size: 12,859 (10,749 distinct boards)
5944.456672 seconds (264.25 k allocations: 165.353 MiB, 0.00% gc time)

##############################################
PROFILING (Overall data for the session)
##############################################
Elapsed Time: 17877.267s
    IPC: 3.282
    SP GFLOPS: 628.383
    DP GFLOPS: 0.001
    x87 GFLOPS: 0.000
    Average CPU Frequency: 3.138 GHz 
Logical Core Utilization: 44.0% (49.298 out of 112)
Physical Core Utilization: 87.9% (49.223 out of 56)

Microarchitecture Usage: 64.8% of Pipeline Slots
    Retiring: 64.8% of Pipeline Slots
    Front-End Bound: 16.4% of Pipeline Slots
    Bad Speculation: 0.5% of Pipeline Slots
    Back-End Bound: 18.3% of Pipeline Slots
        Memory Bound: 4.7% of Pipeline Slots
            L1 Bound: 0.4% of Clockticks
            L2 Bound: 0.3% of Clockticks
            L3 Bound: 0.4% of Clockticks
                L3 Latency: 0.2% of Clockticks
            DRAM Bound: 1.1% of Clockticks
                Memory Bandwidth: 3.3% of Clockticks
                Memory Latency: 6.2% of Clockticks
                    Local DRAM: 1.7% of Clockticks
                    Remote DRAM: 0.0% of Clockticks
                    Remote Cache: 0.0% of Clockticks
            Store Bound: 2.2% of Clockticks
        Core Bound: 13.5% of Pipeline Slots
Memory Bound: 4.7% of Pipeline Slots
    Cache Bound: 1.1% of Clockticks
    DRAM Bound: 1.1% of Clockticks
    NUMA: % of Remote Accesses: 1.4%

Vectorization: 97.1% of Packed FP Operations
    Instruction Mix
        SP FLOPs: 32.1% of uOps
            Packed: 97.1% from SP FP
                128-bit: 97.1% from SP FP
                256-bit: 0.0% from SP FP
                512-bit: 0.0% from SP FP
            Scalar: 2.9% from SP FP
        DP FLOPs: 0.0% of uOps
            Packed: 0.0% from DP FP
                128-bit: 0.0% from DP FP
                256-bit: 0.0% from DP FP
                512-bit: 0.0% from DP FP
            Scalar: 100.0% from DP FP
        x87 FLOPs: 0.0% of uOps
        Non-FP: 67.9% of uOps
    FP Arith/Mem Rd Instr. Ratio: 0.773
    FP Arith/Mem Wr Instr. Ratio: 4.131

Collection and Platform Info
    Collection start time: 11:54:43 02/10/2021 UTC
    Collection stop time: 16:52:41 02/10/2021 UTC
    Collector Type: Driverless Perf per-process counting
    CPU
        Name: Intel(R) Xeon(R) Processor code named Icelake
        Frequency: 2.594 GHz 
        Logical CPU Count: 112
        Cache Allocation Technology
            Level 2 capability: not detected
            Level 3 capability: available

End of calculation time UTC
Sat 02 Oct 2021 04:52:54 PM UTC

Hi, I see its pretty quiet here. FYI, I wanted to let you know that I am planning to do some additional tests on Julia 1.8 as BLAS there is potentially changed to 4096 threads with automatic detection of number of threads on startup. https://discourse.julialang.org/t/blas-performance-testing-for-julia-1-8/69520 This might not be immediately, and I am not sure if there is any significant performance gain to be expected, however, this is my plan for the relatively near future.

I am aware that the tests above are not perfect, however, I tried to be as detailed and precise as possible at that moment.

Would you have any suggestions with regard to the way how the software should be run on CPUs?

I currently understand that this might be particularly important to confirm the setup before further experimenting with AlphaZero parameters.

I am on Julia Version 1.8.0-DEV.875 (2021-11-02). When trying julia -p 128, julia -p 64 on two sockets Xeon(R) Platinum 8358 I am getting:

  OpenBLAS blas_thread_init: pthread_create failed for thread 127 of 128: Resource temporarily unavailable
  OpenBLAS blas_thread_init: RLIMIT_NPROC 1024 current, 1024 max

As for now I understand that OpenBLAS spawns a thread count equal to the number of processors on the machine by default, which is higher than RLIMIT_NPROC on my machine. I understand that one can:

explicitly set the number of OpenBLAS threads by using: export OMP_NUM_THREADS=1 or i.e. start Julia with OMP_NUM_THREADS=1 julia -p 64. or
implicitly change this number per user (on ubuntu) in /etc/security/limits.conf and/or system wide potentially in /etc/sysctl.conf.

What I was currently interested in was to balance the number of Julia processes, number of Julia threads and number of OpenBLAS threads in order to check AlphaZero.jl performance on CPUs and the memory utilization. However, I see that due to some reasons 1.8.0-DEV.875 hangs and I have not investigated this topic further.

As AlphaZero.jl is to my interest I might take a look at it in the near future when 1.8 be closer to RC with a special emphasize on the new profiling mechanism presented recently by @jonathan-laurent. I am also planning to possibly consult the topic with some of my friends. My current understanding is that Julia abilities to run the code on CPUs might be somehow potentially underestimated. I would like to thank you for all valuable inputs and especially @jonathan-laurent for providing such interesting piece of code.

jonathan-laurent / AlphaZero.jl

Questions on parallelization #71

More general advice

741#threadsfor_fun at ./threadingconstructs.jl:85

invokelatest#2 at ./essentials.jl:716 [inlined]

conv_im2col!#390 at /home/xxxxxx/.julia/packages/NNlib/CSWJa/src/impl/conv_im2col.jl:49

155 at ./threadingconstructs.jl:178

Edit: