Open murphycrosby opened 3 years ago
@murphycrosby have you ever had the chance to try on a more recent version of TF?
@murphycrosby Have you tried swapping out the TCN for another model to see if the issue persists? I'd be curious to see if it were actually related to the TCN training in particular or if it's an oddity related to PipeModeDataset. I would also like to see how you are setting up model training in SageMaker and what instance, options, etc. you are using.
It sounds sort of similar to this issue here which is external to the TCN portion of the code. I don't know that it matters much for a normal TFRecordsDataset but order of operations might matter on the PipeModeDataset. You could try rearranging the parse/prefetch/batch ops to match the AWS example here.
Lastly, you could swap to the new FastFile Mode with something like this example here.
Model Fit deadlocks when training on SageMaker with PipeModeDataset. CPUUtilization, MemoryUtilization, DiskUtilization all drop to 0 on the training instance. The model works fine when you swap out PipeModeDataset with tf.data.TFRecordDataset. The for loop proves that the dataset batch has been downloaded.
Example TCN Model
tensorflow==2.3.1
Output