Open greaber opened 7 years ago
@jart Maybe something to think about for the new backend data ingestion. Rather than always computing relative time as (event time - start time), we could tag relative time for each event by using:
x = index_of_current_event()
if first_event_in_file and x == 0:
relative_time[x] = 0
elif first_event_in_file: # probably due to a restart, ignore the time gap
# slightly inaccurate since it is as though this step took no time,
# but probably better than allowing an unrealistically large interval
relative_time[x] = relative_time[x-1]
else:
relative_time[x] = relative_time[x-1] + absolute_time[x] - absolute_time[x-1]
This is super good to know. What you proposed @dandelionmane is absolutely a good idea. It will definitely get rid of some potentially huge gaps.
What our friend seems to be asking for is what time sharing systems would call "CPU time." We could do something like have a running average of time deltas and get rid of anything that deviates too far from the norm. But I feel like it would be rolling dice with data.
If TensorFlow had a feature for tracking CPU/GPU time scheduling, then I would absolutely be interested in using that information to improve the relative time plots.
I also like @dandelionmane's idea. I would further support changing the middle case (elif first_event_in_file
) such that the time delta to the previous event is not zero, because this will cause the data to appear exactly coincident on a scalar chart. Perhaps we could set rel[x] = rel[x - 1] + (rel[x - 1] + rel[x - 2])
, so that the delta is the same as the previous delta? (When x == 1
we'd have to just go with rel[x] = 0
, I guess.)
We could be more realistic by assuming that restart steps take as long as the first step did. This is because the model may take some time to initialize and get warmed up (load caches, initialize variables, what have you).
Any progress on this?
@PetrochukM not that I'm aware of.
Relative mode and wall clock mode are both useful for comparing absolute training time between different variants of a model where the training steps may take different lengths of time. The main advantage of relative mode is that it can cope with the situation where training doesn't start at the same time for different models. But it still doesn't do a good job with the situation where one model stops training for a while and then is restarted (either because the GPU was needed for something else or because the training crashed for some reason and had to be manually restarted). It would be useful if relative mode could detect unusually long intervals between logging to tensorboard and clip them to a reasonable value so as to be useful even when training is interrupted like this.