The TensorFlow Cloud repository provides APIs that will allow to easily go from debugging, training, tuning your Keras and TensorFlow code in a local environment to distributed training/tuning on Cloud.
run
API for GCP training/tuningPython >= 3.6
Google AI platform APIs enabled for your GCP account. We use the AI platform for deploying docker images on GCP.
Either a functioning version of docker if you want to use a local docker process for your build, or create a cloud storage bucket to use with Google Cloud build for docker image build and publishing.
(optional) nbconvert if you
are using a notebook file as entry_point
as shown in
usage guide #4.
For detailed end to end setup instructions, please see Setup instructions.
pip install -U tensorflow-cloud
git clone https://github.com/tensorflow/cloud.git
cd cloud
pip install src/python/.
TensorFlow Cloud package provides the run
API for training your models on GCP.
To start, let's walk through a simple workflow using this API.
Let's begin with a Keras model training code such as the following, saved as
mnist_example.py
.
import tensorflow as tf
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape((60000, 28 * 28))
x_train = x_train.astype('float32') / 255
model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation='relu', input_shape=(28 * 28,)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=128)
After you have tested this model on your local environment for a few epochs,
probably with a small dataset, you can train the model on Google Cloud by
writing the following simple script scale_mnist.py
.
import tensorflow_cloud as tfc
tfc.run(entry_point='mnist_example.py')
Running scale_mnist.py
will automatically apply TensorFlow
one device strategy
and train your model at scale on Google Cloud Platform. Please see the
usage guide section for detailed instructions and additional
API parameters.
You will see an output similar to the following on your console. This information can be used to track the training job status.
user@desktop$ python scale_mnist.py
Job submitted successfully.
Your job ID is: tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e
Please access your job logs at the following URL:
https://console.cloud.google.com/mlengine/jobs/tf_cloud_train_519ec89c_a876_49a9_b578_4fe300f8865e?project=prod-123
End to end instructions to help set up your environment for Tensorflow Cloud. You use one of the following notebooks to setup your project or follow the instructions below.
Run in Colab | View on GitHub | Run in Kaggle |
Create a new local directory
mkdir tensorflow_cloud
cd tensorflow_cloud
Make sure you have python >= 3.6
python -V
Set up virtual environment
virtualenv tfcloud --python=python3
source tfcloud/bin/activate
Set up your Google Cloud project
Verify that gcloud sdk is installed.
which gcloud
Set default gcloud project
export PROJECT_ID=<your-project-id>
gcloud config set project $PROJECT_ID
Create a service account.
export SA_NAME=<your-sa-name>
gcloud iam service-accounts create $SA_NAME
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
--role 'roles/editor'
Create a key for your service account.
gcloud iam service-accounts keys create ~/key.json --iam-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com
Create the GOOGLE_APPLICATION_CREDENTIALS environment variable.
export GOOGLE_APPLICATION_CREDENTIALS=~/key.json
Create a Cloud Storage bucket. Using Google Cloud build is the recommended method for building and publishing docker images, although we optionally allow for local docker daemon process depending on your specific needs.
BUCKET_NAME="your-bucket-name"
REGION="us-central1"
gcloud auth login
gsutil mb -l $REGION gs://$BUCKET_NAME
(optional for local docker setup) shell sudo dockerd
Authenticate access to Google Cloud registry.
gcloud auth configure-docker
Install nbconvert if you plan
to use a notebook file entry_point
as shown in
usage guide #4.
pip install nbconvert
Install latest release of tensorflow-cloud
pip install tensorflow-cloud
As described in the high level overview, the run
API
allows you to train your models at scale on GCP. The
run
API can be used in four different ways. This is defined by where you are running
the API (Terminal vs IPython notebook), and your entry_point
parameter.
entry_point
is an optional Python script or notebook file path to the file
that contains your TensorFlow Keras training code. This is the most important
parameter in the API.
run(entry_point=None,
requirements_txt=None,
distribution_strategy='auto',
docker_config='auto',
chief_config='auto',
worker_config='auto',
worker_count=0,
entry_point_args=None,
stream_logs=False,
job_labels=None,
**kwargs)
Using a python file as entry_point
.
If you have your tf.keras
model in a python file (mnist_example.py
),
then you can write the following simple script (scale_mnist.py
) to scale
your model on GCP.
import tensorflow_cloud as tfc
tfc.run(entry_point='mnist_example.py')
Please note that all the files in the same directory tree as entry_point
will be packaged in the docker image created, along with the entry_point
file. It's recommended to create a new directory to house each cloud project
which includes necessary files and nothing else, to optimize image build
times.
Using a notebook file as entry_point
.
If you have your tf.keras
model in a notebook file
(mnist_example.ipynb
), then you can write the following simple script
(scale_mnist.py
) to scale your model on GCP.
import tensorflow_cloud as tfc
tfc.run(entry_point='mnist_example.ipynb')
Please note that all the files in the same directory tree as entry_point
will be packaged in the docker image created, along with the entry_point
file. Like the python script entry_point
above, we recommended creating a
new directory to house each cloud project which includes necessary files and
nothing else, to optimize image build times.
Using run
within a python script that contains the tf.keras
model.
You can use the run
API from within your python file that contains the
tf.keras
model (mnist_scale.py
). In this use case, entry_point
should
be None
. The run
API can be called anywhere and the entire file will be
executed remotely. The API can be called at the end to run the script
locally for debugging purposes (possibly with fewer epochs and other flags).
import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_cloud as tfc
tfc.run(
entry_point=None,
distribution_strategy='auto',
requirements_txt='requirements.txt',
chief_config=tfc.MachineConfig(
cpu_cores=8,
memory=30,
accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
accelerator_count=2),
worker_count=0)
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']
num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples
BUFFER_SIZE = 10000
BATCH_SIZE = 64
def scale(image, label):
image = tf.cast(image, tf.float32)
image /= 255
return image, label
train_dataset = mnist_train.map(scale).cache()
train_dataset = train_dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(
28, 28, 1)),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])
model.fit(train_dataset, epochs=12)
Please note that all the files in the same directory tree as the python script will be packaged in the docker image created, along with the python file. It's recommended to create a new directory to house each cloud project which includes necessary files and nothing else, to optimize image build times.
Using run
within a notebook script that contains the tf.keras
model.
In this use case, entry_point
should be None
and
docker_config.image_build_bucket
must be specified, to ensure the build
can be stored and published.
By default, run
API takes care of wrapping your model code in a TensorFlow
distribution strategy based on the cluster configuration you have provided.
No distribution
CPU chief config and no additional workers
tfc.run(entry_point='mnist_example.py',
chief_config=tfc.COMMON_MACHINE_CONFIGS['CPU'])
OneDeviceStrategy
1 GPU on chief (defaults to AcceleratorType.NVIDIA_TESLA_T4
) and no
additional workers.
tfc.run(entry_point='mnist_example.py')
MirroredStrategy
Chief config with multiple GPUS (AcceleratorType.NVIDIA_TESLA_V100
).
tfc.run(entry_point='mnist_example.py',
chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_4X'])
MultiWorkerMirroredStrategy
Chief config with 1 GPU and 2 workers each with 8 GPUs
(AcceleratorType.NVIDIA_TESLA_V100
).
tfc.run(entry_point='mnist_example.py',
chief_config=tfc.COMMON_MACHINE_CONFIGS['V100_1X'],
worker_count=2,
worker_config=tfc.COMMON_MACHINE_CONFIGS['V100_8X'])
TPUStrategy
Chief config with 1 CPU and 1 worker with TPU.
tfc.run(entry_point="mnist_example.py",
chief_config=tfc.COMMON_MACHINE_CONFIGS["CPU"],
worker_count=1,
worker_config=tfc.COMMON_MACHINE_CONFIGS["TPU"])
Please note that TPUStrategy with TensorFlow Cloud works only with TF version 2.1 as this is the latest version supported by AI Platform cloud TPU
Custom distribution strategy
If you would like to take care of specifying distribution strategy in your
model code and do not want run
API to create a strategy, then set
distribution_stategy
as None
. This will be required for example when you
are using strategy.experimental_distribute_dataset
.
tfc.run(entry_point='mnist_example.py',
distribution_strategy=None,
worker_count=2)
The API call will encompass the following:
By default, we will use local docker daemon for building and publishing docker
images to Google container registry. Images are published to
gcr.io/your-gcp-project-id
. If you specify docker_config.image_build_bucket
,
then we will use Google Cloud build to
build and publish docker images.
We use Google AI platform for deploying docker images on GCP.
Please note that, when entry_point
argument is specified, all the files in the
same directory tree as entry_point
will be packaged in the docker image
created, along with the entry_point
file.
Please see run
API documentation for detailed information on the parameters
and how you can modify the above processes to suit your needs.
cd src/python/tensorflow_cloud/core
python tests/examples/call_run_on_script_with_keras_fit.py
entry_point
(Keras fit API).entry_point
(Keras custom training loop).entry_point
(Keras save and load).entry_point
.run
within a python script that contains the tf.keras
model.pytest src/python/tensorflow_cloud/core/tests/unit/
Things to keep in mind when running your jobs remotely:
[Coming soon]
Here are some tips for fixing unexpected issues.
Error like: Creating a generator within a strategy scope is disallowed, because there is ambiguity on how to replicate a generator (e.g. should it be copied so that each replica gets the same random numbers, or 'split' so that each replica gets different random numbers).
Solution: Passing distribution_strategy='auto'
to run
API wraps all of
your script in a TF distribution strategy based on the cluster configuration
provided. You will see the above error or something similar to it, if for some
reason an operation is not allowed inside distribution strategy scope. To fix
the error, please pass None
to the distribution_strategy
param and create a
strategy instance as part of your training code as shown in
this
example.
Error like: requests.exceptions.ConnectionError: ('Connection aborted.', timeout('The write operation timed out'))
Solution: The directory being used as an entry point likely has too much data for the image to successfully build, and there may be extraneous data included in the build. Reformat your directory structure such that the folder which contains the entry point only includes files necessary for the current project.
Error like: There was an error submitting the job.Field: tpu_tf_version Error: The specified runtime version '2.3' is not supported for TPU training. Please specify a different runtime version.
Solution: Please use TF version 2.1. See TPU Strategy in Cluster and distribution strategy configuration section.
Warning like: Docker parent image '2.4.0.dev20200720' does not exist. Using the latest TF nightly build.
Solution: If you do not provide docker_config.parent_image
param, then by
default we use pre-built TF docker images as parent image. If you do not have TF
installed on the environment where run
is called, then TF docker image for the
latest
stable release will be used. Otherwise, the version of the docker image
will match the locally installed TF version. However, pre-built TF docker images
aren't available for TF nightlies except for the latest. So, if your local TF is
an older nightly version, we upgrade to the latest nightly automatically and
raise this warning.
Error like: RuntimeError: Mixing different tf.distribute.Strategy objects.
Solution: Please provide distribution_strategy=None
when you already have
a distribution strategy defined in your model code. Specifying
distribution_strategy'='auto'
, will wrap your code in a TensorFlow
distribution strategy. This will cause the above error, if there is a strategy
object already used in your code.
We welcome community contributions, see CONTRIBUTING.md and, for style help, Writing TensorFlow documentation guide.
This application reports technical and operational details of your usage of Cloud Services in accordance with Google privacy policy, for more information please refer to https://policies.google.com/privacy. If you wish to opt-out, you may do so by running tensorflow_cloud.utils.google_api_client.optout_metrics_reporting().