请问你的电脑配置是什么呢，想参考一下

LeungH commented 6 years ago

请问你的电脑配置是什么呢，想参考一下而且有个问题想问一问，我用的tensorflow跑885的大小，cpu是i5 6300hq，gpu是gtx 965m，跑出来的时间gpu比cpu更耗时，是因为gpu太差了吗

junxiaosong commented 6 years ago

我电脑 i5-4590, GPU: GTX750, 我自己的实验都是基于最开始的Theano的版本跑的，tensorflow的版本只简单跑了下确认逻辑的正确性，在性能方面也没有经验。 PS：如果有什么改进性能方面的发现和建议，欢迎反馈~

mine260309 commented 6 years ago

我的配置Win10, E5-1650, GTX970 4GiB显存，在Tensorflow上跑了一下，发现确实很慢。GPU显存基本上占满了，但是GPU usage只有3~4% 看上去是没有充分利用GPU

mine260309 commented 6 years ago

打了些log，看上去时间主要花在start_self_play()里了，调用MCTSPlayer.get_action()的时候

acts, probs = self.mcts.get_move_probs(board, temp)

这一步要花2秒左右的时间。。。

junxiaosong commented 6 years ago

时间主要集中在get_move_probs这一步是合理的，因为调用一次就会执行400次MCTS playout。可能还得进一步深入才能看到有没有什么可以优化的地方。

Kelvin-Zhong commented 6 years ago

same here,挂了个gpu发现跑出来比cpu更慢

Kelvin-Zhong commented 6 years ago

感谢你的详细回复 @BIGBALLON 我是挂在aws上跑的，应该是跑起来了，下面是输出结果，感觉速度差不多？不管是分布式还是multi-process，都要修改原来的代码，这个工作量有点大了，我觉得还是看看有哪一些比较耗时的操作可以简化一下，还是得对CPU的计算进行breakdown

2018-03-16 08:06:12.535357: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-16 08:06:12.535588: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-03-16 08:06:12.535620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-03-16 08:06:12.823584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-03-16 08:06:12.823643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0
2018-03-16 08:06:12.823656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N
2018-03-16 08:06:12.823847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3651 MB memory) -> physical GPU (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0)
batch i:1, episode_len:16
batch i:2, episode_len:9

ubuntu@ip-172-31-34-117:~$ nvidia-smi
Fri Mar 16 08:06:41 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111                Driver Version: 384.111                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           On   | 00000000:00:03.0 Off |                  N/A |
| N/A   43C    P0    46W / 125W |   3812MiB /  4036MiB |     20%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     26831      C   python3                                     3801MiB |
+-----------------------------------------------------------------------------+

junxiaosong commented 6 years ago

关于GPU不比CPU快这个问题，我觉得可能有两个方面的原因： 1、AlphaZero训练本身就有很大一部分运算是需要在cpu上进行的，频繁的在cpu和gpu之间交换数据本身也会有一定开销。 2、我们跑的棋盘很小，而且我用的网络本身也很浅，所以网络forward计算这部分运算放到GPU上带来的收益可能都被额外的数据传输开销抵掉了。

如果棋盘大一些（但相比于一般的图像其实还是很小的），同时用比较深的网络的话，那时候用GPU可能才能发挥更大的作用。

Kelvin-Zhong commented 6 years ago

@junxiaosong 想请教一下，增加网络的深度效果好还是增加input dimension效果好呢？然后要怎样去增加呢？另外GPU的作用在这只能在于forward计算吗？蒙特卡洛树的搜索和simulation有办法转换成矩阵运算什么的来发挥GPU优势吗？

junxiaosong / AlphaZero_Gomoku

请问你的电脑配置是什么呢，想参考一下 #15