tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.28k stars 74.31k forks source link

Loading Tensorflow models from disk is instant while loading from gcs is very slow #36505

Open Sam-Persoon opened 4 years ago

Sam-Persoon commented 4 years ago

System information

Describe the current behavior

I want to use Tensorflow serving to load multiple Models. If I mount a directory containing the model, loading everything is done in an instant, while loading them from a gs:// path takes around 10 seconds per model.

While researching the issue I discovered this is probably a Tensorflow issue and not a Tensorflow Serving issue as loading them in Tensorflow is a huge difference as well:

    [ins] In [22]: %timeit tf.saved_model.load('test/1')
    3.88 s ± 719 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

    [ins] In [23]: %timeit tf.saved_model.load('gs://path/to/test/1')
    30.6 s ± 2.66 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Then it could be that downloading the model (which is very small) is slow, but I tested this as well:

    def test_load():
        bucket_name = 'path'
        folder='test'
        delimiter='/'
        file = 'to/test/1'

        bucket=storage.Client().get_bucket(bucket_name)
        blobs=bucket.list_blobs(prefix=file, delimiter=delimiter)
        for blob in blobs:
            print(blob.name)
            destination_uri = '{}/{}'.format(folder, blob.name)
            blob.download_to_filename(destination_uri)

    [ins] In [31]: %timeit test_load()
    541 ms ± 54.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Any idea what is happening here?

Describe the expected behavior Around the same load time for external vs local models. The first load from gs can be slow if the auth needs to happen, but even authenticating is still way faster than the models loads.

amahendrakar commented 4 years ago

@Sam-Persoon, In order to expedite the trouble-shooting process, could you please provide the complete code to reproduce the issue reported here. Thanks!

Sam-Persoon commented 4 years ago

This is just an untrained example model from the TF docs, but it has the issue:

from tensorflow.keras import layers
from tensorflow import keras
import timeit

num_tags = 12  # Number of unique issue tags
num_words = 10000  # Size of vocabulary obtained when preprocessing text data
num_departments = 4  # Number of departments for predictions

title_input = keras.Input(shape=(None,), name='title')  # Variable-length sequence of ints
body_input = keras.Input(shape=(None,), name='body')  # Variable-length sequence of ints
tags_input = keras.Input(shape=(num_tags,), name='tags')  # Binary vectors of size `num_tags`

# Embed each word in the title into a 64-dimensional vector
title_features = layers.Embedding(num_words, 64)(title_input)
# Embed each word in the text into a 64-dimensional vector
body_features = layers.Embedding(num_words, 64)(body_input)

# Reduce sequence of embedded words in the title into a single 128-dimensional vector
title_features = layers.LSTM(128)(title_features)
# Reduce sequence of embedded words in the body into a single 32-dimensional vector
body_features = layers.LSTM(32)(body_features)

# Merge all available features into a single large vector via concatenation
x = layers.concatenate([title_features, body_features, tags_input])

# Stick a logistic regression for priority prediction on top of the features
priority_pred = layers.Dense(1, activation='sigmoid', name='priority')(x)
# Stick a department classifier on top of the features
department_pred = layers.Dense(num_departments, activation='softmax', name='department')(x)

# Instantiate an end-to-end model predicting both priority and department
model = keras.Model(inputs=[title_input, body_input, tags_input],
                    outputs=[priority_pred, department_pred])
model.compile(optimizer='adam', loss='mae')
model.save('test')

# --------
import_module = "import tensorflow as tf"
code = "tf.saved_model.load('test')"
print(timeit.timeit(stmt=code, setup=import_module, number=10))

# Upload the test model to a gcs bucket you can access
code_gcs = "tf.saved_model.load('gs://your-bucket/test')"
print(timeit.timeit(stmt=code_gcs, setup=import_module, number=10))
mihaimaruseac commented 4 years ago

There are two different filesystems involved. Can you try to post the times for the following code

local_filename = "/path/to/some/local/file"
gcs_filename = "gs://bucket/and/path/to/file"
size = 1024

filename = local_filename   # or gcs_filename
with tf.io.gfile.GFile(filename) as f:
  f.read(size)

Ensure that the file has at least size bits, keep increasing the size (1KB, 2KB, 4KB, ... 1MB, 1GB) and compare the two times (maybe plot them)?

Sam-Persoon commented 4 years ago

The following code is used for the plot

# --------
import_module = "import tensorflow as tf"
sizes = [1024*i for i in [1, 2, 4, 8, 50, 100, 500, 1000]]
local_times = []
gcs_times = []
for size in sizes:
    print(size)
    code = f"""
filename = 'test/saved_model.pb'
with tf.io.gfile.GFile(filename, 'rb') as f:
    f.read({size})
"""
    local_times.append(timeit.timeit(stmt=code, setup=import_module, number=10))
    code_gcs = f"""
filename = 'gs://bucket/test/saved_model.pb'
with tf.io.gfile.GFile(filename, 'rb') as f:
    f.read({size})
"""
    gcs_times.append(timeit.timeit(stmt=code_gcs, setup=import_module, number=10))

Mind the log scales in the plot plot

mihaimaruseac commented 4 years ago

This seems reasonable as the difference between local and remote access over the network.

Sam-Persoon commented 4 years ago

Loading 1 kb in 6s is really slow though. I tested the impact of sequentially loading the models to check if maybe the authentication to gcs impacts things. The following code:

import tensorflow as tf
import time
from google.cloud import storage
import os

model_names = ['test1', 'test2', 'test3', 'test4']

storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket-name')
for model in model_names:
    print(model)

    start_time = time.time()
    tf.saved_model.load(f'gs://bucket-name/models/{model}/1')
    elapsed_time = time.time() - start_time
    print(f'Loading from gcs: {elapsed_time}')

    start_time = time.time()
    blobs = bucket.list_blobs(prefix=f'models/{model}/1')  # Get list of files
    folder = f'./download_folder'
    for blob in blobs:
        // create dirs if not exist, code omitted
        blob.download_to_filename(folder + '/' + blob.name)

        // time local disk loading
    start_time_internal = time.time()
    tf.saved_model.load(os.path.join(folder, 'models', model, '1' ))
    elapsed_time_internal = time.time() - start_time_internal
    print(f'Local disk: {elapsed_time_internal}')

    elapsed_time = time.time() - start_time
    print(f'Downloading from gcs, then loading: {elapsed_time}')

Gives the following outputs:

test1
Loading from gcs: 32.72501277923584
Local disk: 2.4520139694213867
Downloading from gcs, then loading: 4.770485162734985

test2
Loading from gcs: 0.9281768798828125
Local disk: 0.013478279113769531
Downloading from gcs, then loading: 0.45862293243408203

test3
Loading from gcs: 57.323729038238525
Local disk: 1.7698051929473877
Downloading from gcs, then loading: 3.2642600536346436

test4
Loading from gcs: 3.6672658920288086
Local disk: 0.053382158279418945
Downloading from gcs, then loading: 0.8592350482940674

test1 and test3 are larger models, test2 and 4 are smaller. Large models really suffer from this.

mihaimaruseac commented 4 years ago

Can you compare with downloading a file from GCS from command-line? (That is, not using TensorFlow)

Sam-Persoon commented 4 years ago

Single-threaded downloading the files gives the following output:

test1
Loading from gcs: 29.73722219467163
Loading from local disk: 2.546816825866699
Downloading from gcs, then loading from local disk: 3.7411539554595947
Downloading from gcs using cli, then loading from local disk: 9.842350006103516

test2
Loading from gcs: 0.721973180770874
Loading from local disk: 0.01868605613708496
Downloading from gcs, then loading from local disk: 0.31568098068237305
Downloading from gcs using cli, then loading from local disk: 3.2689220905303955

test3
Loading from gcs: 49.419824838638306
Loading from local disk: 1.8356938362121582
Downloading from gcs, then loading from local disk: 2.639312982559204
Downloading from gcs using cli, then loading from local disk: 11.165163040161133

test4
Loading from gcs: 2.7186670303344727
Loading from local disk: 0.054453134536743164
Downloading from gcs, then loading from local disk: 0.8785662651062012
Downloading from gcs using cli, then loading from local disk: 3.097564935684204
Sam-Persoon commented 4 years ago

Any updates on this?

mihaimaruseac commented 4 years ago

I think the issue is that you don't factor in the time spent on network operations. To me, it looks consistent

Sam-Persoon commented 4 years ago

I think the issue is that you don't factor in the time spent on network operations. To me, it looks consistent

I don't see how I am not factoring in these. If I adapt the script posted above to:

import tensorflow as tf
import time
from google.cloud import storage
import os
import subprocess
total_time = time.time()
model_names = ['test1', 'test2', 'test3', 'test4']
for model in model_names:
        start_time = time.time()
        tf.saved_model.load(f'gs://bucket-name/models/{model}/1')
        elapsed_time = time.time() - start_time
        print(f'Loading from gcs: {elapsed_time}')
total_elapsed_time = time.time() - total_time
print(f"Total elapsed time: {total_elapsed_time}")

I get the following output:

Loading from gcs: 28.524829864501953
Loading from gcs: 0.7831361293792725
Loading from gcs: 52.6575129032135
Loading from gcs: 3.1442267894744873
Total elapsed time: 85.57314991950989

And downloading them first and then loading:

import tensorflow as tf
import time
from google.cloud import storage
import os
import subprocess
total_time = time.time()
model_names = ['test1', 'test2', 'test3', 'test4']

storage_client = storage.Client()
bucket = storage_client.get_bucket('bucket-name')
for model in model_names:
        start_time = time.time()
        blobs = bucket.list_blobs(prefix=f'models/{model}/1')  # Get list of files
        folder = f'./download_folder'
        for blob in blobs:
                # create folder if not exist
                blob.download_to_filename(folder + '/' + blob.name)

        start_time_internal = time.time()
        tf.saved_model.load(os.path.join(folder, 'models', model, '1' ))
        elapsed_time_internal = time.time() - start_time_internal
        print(f'Loading from local disk: {elapsed_time_internal}')

        elapsed_time = time.time() - start_time
        print(f'Downloading from gcs, then loading from local disk: {elapsed_time}')

)

total_elapsed_time = time.time() - total_time
print(f"Total elapsed time: {total_elapsed_time}")

Gives following output

Loading from local disk: 2.6329727172851562
Downloading from gcs, then loading from local disk: 4.476565837860107
Loading from local disk: 0.025137901306152344
Downloading from gcs, then loading from local disk: 0.4631459712982178
Loading from local disk: 1.6898088455200195
Downloading from gcs, then loading from local disk: 3.088888168334961
Loading from local disk: 0.04868316650390625
Downloading from gcs, then loading from local disk: 1.0312209129333496
Total elapsed time: 9.417081117630005

I timed everything now, so network operations are in here. I can execute both scripts multiple times, the execution times are always in line with these results.

mihaimaruseac commented 4 years ago

Thanks. I'll get a look into this, though not this week or the next (loaded with other work items).

Sam-Persoon commented 4 years ago

Thanks. I'll get a look into this, though not this week or the next (loaded with other work items).

Thanks.

Sam-Persoon commented 4 years ago

As a month passed since the last comment, any updates on this?

mihaimaruseac commented 4 years ago

Sorry, but unfortunately due to COVID my work schedule became more chaotic. So it will still take some time. Apologies for this

ethanyanjiali commented 4 years ago

Having a same issue here, except that I'm loading a big SavedModel and it's extremely slow. Downloading the same model with gsutil is much faster so it rules out the problem of the network.

mihaimaruseac commented 3 years ago

The filesystem glob matching implementation has recently changed. Let's test with the 2.4 release when it gets out to see if this is made better

mihaimaruseac commented 3 years ago

TF 2.4 has been released. Is this still an issue?

mihaimaruseac commented 3 years ago

Please reopen if this is still an issue in 2.4 release.

derekhh commented 3 years ago

Hi @mihaimaruseac - I can confirm this is still an issue in TF2.4

This is the code I used to test this issue:

import sys
import time

import tensorflow as tf

def main(argv):
    print(tf.__version__)
    tic = time.time()
    tf.saved_model.load(argv[0])
    toc = time.time()
    print(f'Local load time: {toc - tic:.2f}s')
    tic = time.time()
    tf.saved_model.load(argv[1])
    toc = time.time()
    print(f'Remote load time: {toc - tic:.2f}s')

if __name__ == '__main__':
    main(sys.argv[1:])

I copied the SavedModel directory I'm loading from GCS to local on a Google Cloud Virtual Machine. The directory is around 170MB in size and copying takes less than 2 seconds.

Test result on TF 2.2.2:

hao.hu@derekhh-dev /tmp
tmp_py37 ❯ python temp.py /tmp/202102051600 gs://foo/202102051600/

2.2.2
Local load time: 20.78s
Remote load time: 173.86s

Test result on TF 2.3.2:

hao.hu@derekhh-dev /tmp
tmp_py37 ❯ python temp.py /tmp/202102051600 gs://foo/202102051600/

2.3.2
Local load time: 20.98s
Remote load time: 180.84s

Test result on TF 2.4.1:

hao.hu@derekhh-dev /tmp
tmp_py37 ❯ python temp.py /tmp/202102051600 gs://foo/202102051600/

2.4.1
Local load time: 20.63s
Remote load time: 195.49s
mihaimaruseac commented 3 years ago

Thank you for confirming.

lgeiger commented 3 years ago

@mihaimaruseac Thanks for investigating. I created a quick profile with TF 2.4 and most of the time is actually spent in the following function call mainly due to the execution of the restore_v2 operation, so I am not sure if the root cause is actually related to the TF file system implementation: https://github.com/tensorflow/tensorflow/blob/7a1b65b61397dbad1d400e1d3c7079246dfe309a/tensorflow/python/saved_model/load.py#L161

lgeiger commented 3 years ago

@mihaimaruseac did you have any chance to look into this issue?

Reloading models (even small ones) from GCS seems almost unusable even with a super fast network connection so unfortunately the only workaround currently is to download the saved model (which tends to be super fast) and read it from the local disk which can be quite cumbersome for users.

mihaimaruseac commented 3 years ago

Unfortunately, I haven't :( I'm working on the security issues at the moment and putting everything else on a lower priority track as there is not enough time to work on everything :(

luke-mcdowall-yeo commented 2 years ago

Was there any resolution to this or a work around found?

Venkat6871 commented 3 months ago

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

mihaimaruseac commented 3 months ago

This is still an issue