Open eedalong opened 2 years ago
今天重新分析了下,TLB的命中缺失带来的开销和特征Dim有关系的一个核心原因很可能是PTW带来的时间开销相对于特征读取开销的占比。特征比较大的时候,PTW的开销相对而言没有那么明显,而特征比较小的时候,PTW的开销相对占比就会比较高了。
所以这件事儿其实是两个维度:
POST_LIST_SIZE = 128
CQ_MOD = 1
QP_NUM = 8
TX_DEPTH = 2048
Server1 | Server2 | |
---|---|---|
0 | 8798.309074974653 MB/s | 8925.925280242674 MB/s |
1 | 8776.74163466813 MB/s | 8940.264366411147 MB/s |
2 | 8864.57287302192 MB/s | 8876.406442329364 MB/s |
Avg | 8813.207860888235 MB/s | 8914.198696327729 MB/s |
Server1-GPU1 | Server1-GPU2 | Server2-GPU1 | Server2-GPU2 | |
---|---|---|---|---|
0 | 8592.910848549946 MB/s | 8788.04002677606 MB/s | 8784.270665339876 MB/s | 8780.655119190533 MB/s |
1 | 8797.553180521667 MB/s | 8774.936587372318 MB/s | 8914.114595121611 MB/s | 8973.797213215319 MB/s |
2 | 8524.098892866063 MB/s | 8900.942248183304 MB/s | 8922.503180384434 MB/s | 8851.85249217683 MB/s |
Avg | 8638.187640645892 MB/s | 8821.306287443893 MB/s | 8873.629480281974 MB/s | 8868.768274860895 MB/s |
Server1-GPU1 | Server1-GPU2 | Server1-GPU3 | Server2-GPU1 | Server2-GPU2 | Server2-GPU3 | |
---|---|---|---|---|---|---|
0 | 8482.438701126574 MB/s | 8943.231441048036 MB/s | 8717.267681411107 MB/s | 8778.8484619869 MB/s | 8799.670012374536 MB/s | 8948.702263392466 MB/s |
1 | 8652.1562795728 MB/s | 8897.38465548701 MB/s | 8966.253962138591 MB/s | 8694.766158339844 MB/s | 8954.179783140959 MB/s | 8694.766158339844 MB/s |
2 | 8745.708282800677 MB/s | 8748.099167905411 MB/s | 8819.982773471145 MB/s | 8723.803032884649 MB/s | 8905.586864259376 MB/s | 8735.561584003002 MB/s |
Avg | 8626.767754500017 MB/s | 8862.90508814682 MB/s | 8834.50147234028 MB/s | 8732.472551070465 MB/s | 8886.478886591625 MB/s | 8793.010001911769 MB/s |
Server1 | Server2 | |
---|---|---|
0 | 8894.293407452446 MB/s | 9021.231609549819 MB/s |
1 | 9041.782926570833 MB/s | 9033.646805582512 MB/s |
2 | 8788.643424824484 MB/s | 8908.06597536363 MB/s |
Avg | 8908.239919615922 MB/s | 8987.64813016532 MB/s |
Server1-GPU1 | Server1-GPU2 | Server2-GPU1 | Server2-GPU2 | |
---|---|---|---|---|
0 | 8828.347271316492 MB/s | 8765.472256937906 MB/s | 8852.311629032816 MB/s | 9036.197737420802 MB/s |
1 | 8821.958405844547 MB/s | 8898.31244894767 MB/s | 9007.107170501724 MB/s | 8746.75413420801 MB/s |
2 | 8805.723720418271 MB/s | 8874.560171944604 MB/s | 8978.203307205358 MB/s | 8830.022075055187 MB/s |
Avg | 8818.67646585977 MB/s | 8846.114959276727 MB/s | 8945.874035579967 MB/s | 8870.991315561332 MB/s |
Server1-GPU1 | Server1-GPU2 | Server1-GPU3 | Server2-GPU1 | Server2-GPU2 | Server2-GPU3 | |
---|---|---|---|---|---|---|
0 | 8924.525013073035 MB/s | 8956.999405199258 MB/s | 8989.553156000351 MB/s | 8889.66056081257 MB/s | 8843.596165471976 MB/s | 8831.39284174213 MB/s |
1 | 8607.790723088045 MB/s | 8871.484760799127 MB/s | 9010.594488050403 MB/s | 8880.255307340087 MB/s | 8852.15857812203 MB/s | 7898.187427689934 MB/s |
2 | 8737.94692379896 MB/s | 8720.385604550951 MB/s | 8884.570000694108 MB/s | 8930.59601262842 MB/s | 8805.723720418271 MB/s | 8772.831637024092 MB/s |
Avg | 8756.75421998668 MB/s | 8849.62325684978 MB/s | 8961.572548248287 MB/s | 8900.170626927027 MB/s | 8833.82615467076 MB/s | 8500.80396881872 MB/s |
RDMA TLB Results
call for help: @Aiemu https://github.com/quiver-team/quiver-feature/blob/main/tests/python/test_MultiMachineDistTensorClientServer.py
IB Params:
FeatureDim = 128, Tensor Size: 228.8818359375 GB, Sample Size = 250000
W/O TLB
2机2卡: 8488.63404334975 MB/s 2机4卡: 2机6卡:
W/ TLB
2机2卡: 2机4卡: 2机6卡