quiver-team / quiver-feature

High performance RDMA-based distributed feature collection component for training GNN model on EXTREMELY large graph
Apache License 2.0
48 stars 5 forks source link

RDMA TLB 在不同特征维度下的测试 #19

Open eedalong opened 2 years ago

eedalong commented 2 years ago

RDMA TLB Results

call for help: @Aiemu https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py

IB Params:

 POST_LIST_SIZE = 128
 CQ_MOD = 1
 QP_NUM = 8
 TX_DEPTH = 2048

FeatureDim = 128, Tensor Size: 228.8818359375 GB, Sample Size = 250000

W/O TLB

2机2卡: 8488.63404334975 MB/s 2机4卡: 2机6卡:

W/ TLB

2机2卡: 2机4卡: 2机6卡

eedalong commented 2 years ago

今天重新分析了下,TLB的命中缺失带来的开销和特征Dim有关系的一个核心原因很可能是PTW带来的时间开销相对于特征读取开销的占比。特征比较大的时候,PTW的开销相对而言没有那么明显,而特征比较小的时候,PTW的开销相对占比就会比较高了。

所以这件事儿其实是两个维度:

  1. 固定FeatureDim,不断的增大NUM_ELEMENT
  2. 固定NUM_ELEMENT, 不断的增大FeatureDim
Aiemu commented 2 years ago

IB Params

POST_LIST_SIZE = 128
CQ_MOD = 1
QP_NUM = 8
TX_DEPTH = 2048

W/O TLB

2机2卡

Server1 Server2
0 8798.309074974653 MB/s 8925.925280242674 MB/s
1 8776.74163466813 MB/s 8940.264366411147 MB/s
2 8864.57287302192 MB/s 8876.406442329364 MB/s
Avg 8813.207860888235 MB/s 8914.198696327729 MB/s

2机4卡

Server1-GPU1 Server1-GPU2 Server2-GPU1 Server2-GPU2
0 8592.910848549946 MB/s 8788.04002677606 MB/s 8784.270665339876 MB/s 8780.655119190533 MB/s
1 8797.553180521667 MB/s 8774.936587372318 MB/s 8914.114595121611 MB/s 8973.797213215319 MB/s
2 8524.098892866063 MB/s 8900.942248183304 MB/s 8922.503180384434 MB/s 8851.85249217683 MB/s
Avg 8638.187640645892 MB/s 8821.306287443893 MB/s 8873.629480281974 MB/s 8868.768274860895 MB/s

2机6卡

Server1-GPU1 Server1-GPU2 Server1-GPU3 Server2-GPU1 Server2-GPU2 Server2-GPU3
0 8482.438701126574 MB/s 8943.231441048036 MB/s 8717.267681411107 MB/s 8778.8484619869 MB/s 8799.670012374536 MB/s 8948.702263392466 MB/s
1 8652.1562795728 MB/s 8897.38465548701 MB/s 8966.253962138591 MB/s 8694.766158339844 MB/s 8954.179783140959 MB/s 8694.766158339844 MB/s
2 8745.708282800677 MB/s 8748.099167905411 MB/s 8819.982773471145 MB/s 8723.803032884649 MB/s 8905.586864259376 MB/s 8735.561584003002 MB/s
Avg 8626.767754500017 MB/s 8862.90508814682 MB/s 8834.50147234028 MB/s 8732.472551070465 MB/s 8886.478886591625 MB/s 8793.010001911769 MB/s

W/ TLB

2机2卡

Server1 Server2
0 8894.293407452446 MB/s 9021.231609549819 MB/s
1 9041.782926570833 MB/s 9033.646805582512 MB/s
2 8788.643424824484 MB/s 8908.06597536363 MB/s
Avg 8908.239919615922 MB/s 8987.64813016532 MB/s

2机4卡

Server1-GPU1 Server1-GPU2 Server2-GPU1 Server2-GPU2
0 8828.347271316492 MB/s 8765.472256937906 MB/s 8852.311629032816 MB/s 9036.197737420802 MB/s
1 8821.958405844547 MB/s 8898.31244894767 MB/s 9007.107170501724 MB/s 8746.75413420801 MB/s
2 8805.723720418271 MB/s 8874.560171944604 MB/s 8978.203307205358 MB/s 8830.022075055187 MB/s
Avg 8818.67646585977 MB/s 8846.114959276727 MB/s 8945.874035579967 MB/s 8870.991315561332 MB/s

2机6卡

Server1-GPU1 Server1-GPU2 Server1-GPU3 Server2-GPU1 Server2-GPU2 Server2-GPU3
0 8924.525013073035 MB/s 8956.999405199258 MB/s 8989.553156000351 MB/s 8889.66056081257 MB/s 8843.596165471976 MB/s 8831.39284174213 MB/s
1 8607.790723088045 MB/s 8871.484760799127 MB/s 9010.594488050403 MB/s 8880.255307340087 MB/s 8852.15857812203 MB/s 7898.187427689934 MB/s
2 8737.94692379896 MB/s 8720.385604550951 MB/s 8884.570000694108 MB/s 8930.59601262842 MB/s 8805.723720418271 MB/s 8772.831637024092 MB/s
Avg 8756.75421998668 MB/s 8849.62325684978 MB/s 8961.572548248287 MB/s 8900.170626927027 MB/s 8833.82615467076 MB/s 8500.80396881872 MB/s
eedalong commented 2 years ago

@Aiemu 你跑一下这个文件,维度仍然设置为128,但是不要设置过大的FeatureSize,测个80G左右的就行。