v6d-io / v6d

vineyard (v6d): an in-memory immutable data manager. (Project under CNCF, TAG-Storage)
https://v6d.io
Apache License 2.0
828 stars 122 forks source link

Optimize the speed of concurrent get of pytorch models #1884

Closed dashanji closed 3 months ago

dashanji commented 4 months ago

Describe your problem

Currently, getting a pytorch module at high concurrency is very slow as follows. The test machines's max bandwidth are both 30Gbps.

Vineyard

Concurrencies Time of getting Observed Network Bandwith from Dstat
1 2.57s around 2000Mi
6 7.73s around 3800Mi
13 14.58s around 3800Mi
27 29.32s around 3800Mi

Iperf

Concurrencies Observed Network Bandwith from Dstat Total Network bandwidth
1 around 1470Mi 12Gbits/s (1500Mib/s)
6 around 3700Mi 31.1Gbit/s (3888Mib/s)
13 around 3650Mi 30.9Gbit/s (3863Mib/s)
27 around 3650Mi 30.9Gbit/s (3863Mib/s)

Solution

In the actual scenery, the pytorch models used to be loaded in the machine with GPU, which always have high- performance networks. Thus, the bandwidth of vineyardd instance is the bottleneck. We can distribute the PyTorch model blobs among different Vineyard instances to increase network bandwidth.