create a spot fleet or persistent request, it is a kind of persistent spot instance managed by AWS. define the training command by user_data of the spot instance.
read all the data and configuration files in S3, save network to S3. This can be implemented using boto3 and remove the complex dependency of starcluster.
whenever there is an update of saved network file, plot the learning curve online using Plotly.
user_data
of the spot instance.boto3
and remove the complex dependency ofstarcluster
.