oar-team / colmet

Colmet - Collecting metrics about jobs running in a distributed environnement
GNU General Public License v3.0
5 stars 6 forks source link

Total data send divided by two on omi-path #6

Closed adfaure closed 5 years ago

adfaure commented 5 years ago
perfquery -x

PortSelect:......................1
CounterSelect:...................0x0000
PortXmitData:....................15217586
PortRcvData:.....................33204887641
PortXmitPkts:....................3745479
PortRcvPkts:.....................65084876
PortUnicastXmitPkts:.............0
PortUnicastRcvPkts:..............0
PortMulticastXmitPkts:...........15

# send 10000*100000 bytes = 1 Giga octet :

ib_write_bw dahu-31.grenoble.grid5000.fr -s 10000 --iters 100000

perfquery -x

PortSelect:......................1
CounterSelect:...................0x0000
PortXmitData:....................15355105
PortRcvData:.....................33331290814
PortXmitPkts:....................3778876
PortRcvPkts:.....................65384944
PortUnicastXmitPkts:.............0
PortUnicastRcvPkts:..............0
PortMulticastXmitPkts:...........15
PortMulticastRcvPkts:............16

variation PortRcvData :

(33331290814 - 33204887641) * 4 / 1000000000 = 0.505612692 Giga bytes instead of de 1 Giga bytes.

tested on dahu.

lambertrocher commented 5 years ago

Use options --enable-infiniband --omnipath https://github.com/oar-team/colmet/commit/540ddb6aec4c7e262cedbd7f05722d3c919daf25

bzizou commented 5 years ago

Should I update colmet on Dahu or wait a bit for other fixes if any?

adfaure commented 5 years ago

Hi, for the moment it should be fine. It is possible to keep a trace of which jobs were affected by this bug? Maybe by just logging the update dates. So we can taking this bug into account for further analysis.

Thanks.