tensorflow / benchmarks

A benchmark framework for Tensorflow
Apache License 2.0
1.14k stars 630 forks source link

How to evaluate worker performance independently on a distributed training #520

Open delucca opened 2 years ago

delucca commented 2 years ago

Hi

I'm trying to evaluate the performance of each worker independently in a cluster with multiple machines while training them using the same model. My goal is to record each worker training performance.

Every setup and config that I try I always get the same time for all workers (probably because of synchronization issues). So, even if one of my workers is a machine that is 4x faster, it would still record the same time as the slowest machine in the cluster.

Anyone has any idea how can I do that?