Feature Request: clip time delta outliers in relative mode

tensorflow / tensorboard

TensorFlow's Visualization Toolkit

Apache License 2.0

6.72k stars 1.66k forks source link

Feature Request: clip time delta outliers in relative mode #178

Open greaber opened 7 years ago

greaber commented 7 years ago

Relative mode and wall clock mode are both useful for comparing absolute training time between different variants of a model where the training steps may take different lengths of time. The main advantage of relative mode is that it can cope with the situation where training doesn't start at the same time for different models. But it still doesn't do a good job with the situation where one model stops training for a while and then is restarted (either because the GPU was needed for something else or because the training crashed for some reason and had to be manually restarted). It would be useful if relative mode could detect unusually long intervals between logging to tensorboard and clip them to a reasonable value so as to be useful even when training is interrupted like this.

teamdandelion commented 7 years ago

@jart Maybe something to think about for the new backend data ingestion. Rather than always computing relative time as (event time - start time), we could tag relative time for each event by using:

x = index_of_current_event()

if first_event_in_file and x == 0:
  relative_time[x] = 0
elif first_event_in_file: # probably due to a restart, ignore the time gap
  # slightly inaccurate since it is as though this step took no time,
   # but probably better than allowing an unrealistically large interval
  relative_time[x] = relative_time[x-1] 

else:
  relative_time[x] = relative_time[x-1] + absolute_time[x] - absolute_time[x-1]

jart commented 7 years ago

This is super good to know. What you proposed @dandelionmane is absolutely a good idea. It will definitely get rid of some potentially huge gaps.

What our friend seems to be asking for is what time sharing systems would call "CPU time." We could do something like have a running average of time deltas and get rid of anything that deviates too far from the norm. But I feel like it would be rolling dice with data.

If TensorFlow had a feature for tracking CPU/GPU time scheduling, then I would absolutely be interested in using that information to improve the relative time plots.

wchargin commented 7 years ago

I also like @dandelionmane's idea. I would further support changing the middle case (elif first_event_in_file) such that the time delta to the previous event is not zero, because this will cause the data to appear exactly coincident on a scalar chart. Perhaps we could set rel[x] = rel[x - 1] + (rel[x - 1] + rel[x - 2]), so that the delta is the same as the previous delta? (When x == 1 we'd have to just go with rel[x] = 0, I guess.)

teamdandelion commented 7 years ago

We could be more realistic by assuming that restart steps take as long as the first step did. This is because the model may take some time to initialize and get warmed up (load caches, initialize variables, what have you).

PetrochukM commented 6 years ago

Any progress on this?

nfelt commented 6 years ago

@PetrochukM not that I'm aware of.