Tensorflow Hub: Support multi-GPU training in Keras or Estimator

rsethur commented 6 years ago

In my project I use Tf-Hub with estimators. However when I try to use multi GPU's (single machine) using tf.contrib.estimator.replicate_model_fn, I get the following error:

variable_scope was unused but the corresponding ". "name_scope was already taken.

Probably it is from this source line : link

Any help is much appreciated - received with thanks.

CC: @arnoegw

arnoegw commented 6 years ago

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

rsethur commented 6 years ago

Hello @arnoegw , Can you please provide me more guidance/some pseudo code would help. Tf-Hub + Estimators have awesome potential for developers - ironing out these kinks would definitely help.

arnoegw commented 6 years ago

I very much agree: it would be great to iron out the kinks that prevent straightforward use of Hub modules with multi-GPU Estimators. Unfortunately, at this time, I neither have that code, nor worked-out example code for the hack around that I sketched above. Sorry.

Leaving this open for the feature request...

matthew-z commented 6 years ago

+1 The same problem when use estimator.

I also look forward to trying multiGPU with tf-hub

nikolausWest commented 6 years ago

+1 Same issue here. Would like to use tf-hub with estimators and multi GPU.

In the meantime it would also be great with some pseudo code or more detailed explanation on how to hack around it would be really appreciated.

akhilkatpally commented 6 years ago

+1 Same problem when using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

marhlder commented 5 years ago

Did anyone manage to conjure a working hack for this? I was unable to get it to work through a tf.collection

marhlder commented 5 years ago

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

Where would the shared instance have to be created?

Doing something like this in the model_fn does not work:

      if len(tf.get_collection(
          "SHARED_ELMO_INSTANCE_COLLECTION",
          scope=None
      )) == 0:

        elmo = hub.Module("https://tfhub.dev/google/elmo/2", name="ELMO", trainable=True)

        tf.add_to_collection(
          "SHARED_ELMO_INSTANCE_COLLECTION",
          elmo
        )

      elmo = tf.get_collection(
        "SHARED_ELMO_INSTANCE_COLLECTION",
        scope=None
      )[0]

      elmo_representations = elmo(
        inputs={
          "tokens": tokens,
          "sequence_len": tokens_length
        },
        signature="tokens",
        as_dict=True)["elmo"]

jasonkrone commented 5 years ago

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

edumotya commented 5 years ago

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

bjayakumar commented 5 years ago

Came here to report that it is still not fixed. I hope they fix it soon.

Harshini-Gadige commented 5 years ago

@arnoegw Any update or ETA on this ?

arnoegw commented 5 years ago

Hi all, thanks for your patience. We understand that multi-GPU training is important. While it was possible in low-level TensorFlow early on, its support by high-level frameworks has been a moving target. With the advent of TensorFlow 2 (see the recent Dev Summit), both sides of the story are changing again, but for the better:

Hub modules for TF2 will be SavedModels in the TF2 version of that format, loaded natively with tf.saved_model.load(). Under the hood, this provides a clean separation of computation and state, which helps the cause.
DistributionStrategy is the new, more powerful abstraction for various kinds of parallel training.

So the TF2 version of this feature request is DistributionStrategy support for model pieces brought in by loading a SavedModel, preferably through Keras (not low-level TF). This is on the radar for the TensorFlow and TF Hub teams, but there is no specific timeline.

tf.contrib.estimator.replicate_model_fn is deprecated by now. We do not plan to go back and work on supporting it. Let me change the issue title accordingly....

arnoegw commented 5 years ago

For those especially interested in retraining of image models faster than with retrain.py:

If you are ready to live on the cutting edge of TF 2.0.0alpha0, take a look at Hub's examples/colab/tf2_image_retraining.ipynb which is considerably smaller, faster (if you use a GPU), and even supports fine-tuning the image module. However, this is still with a single GPU.

o-90 commented 5 years ago

Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.

If anyone else is hampered by this issue as well, please speak up here.

Really hampered by this issue.

From what I understand tensorflow_hub.Module._try_get_state_scope is complaining because the embeddings are trying to be placed on all available GPUs.

one would have to to share hub.Module instances **for each graph**
that model_fn gets called in

A little more detail on what is meant by that sentence would go along way. Not asking for a solution but some pseudo-code could be great.

r-wheeler commented 5 years ago

I am really hampered by this issue as well.

rsethur commented 5 years ago

@arnoegw Many thanks for the development. Question: How is Hub positioned in comparison to the Keras applications models - seems to be quite similar. Will there be some unification in the future? Also some of the models does not support fine tuning (object detection) - do you plan to fix this in future releases?

Thanks again!

arnoegw commented 5 years ago

@rsethur: There are no plans for unification at this time. TF Hub overlaps with Keras Applications for the particular case of reusing CNNs for image classification / feature extraction, but TF Hub offers modules (sometimes entire models) for a number of other domains, and requires neither the module consumer nor the module publisher to use Keras.

arnoegw commented 5 years ago

@gobrewers14, @r-wheeler: There is no great solution for TF1, but for TF2, there are the plans I described on March 15, and the already available examples/colab/tf2_image_retraining.ipynb with decent fine-tuning performance on a single GPU. Hope that helps.

littleDing commented 5 years ago

+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .

mhajiaghayi commented 5 years ago

I have the same problem with tf-hub and estimator and very disappointed by the response of tf team. sadly, one version to another, there are lots of changes in tensorflow.

Aashish-1008 commented 5 years ago

+1 I'm having the same problem using estimator, tf-hub with multi GPU tf.contrib.distribute.MirroredStrategy(num_gpus=8) .

serdarbozoglan commented 5 years ago

I am also getting the same error: "RuntimeError: variable_scope module_8/ was unused but the corresponding name_scope was already taken."

akshaydnicator commented 4 years ago

Still not fixed I believe. Please help!

RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.

Full Traceback:

RuntimeError Traceback (most recent call last)

in 6 tf.compat.v1.disable_eager_execution() 7 ----> 8 elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True) /opt/conda/lib/python3.6/site-packages/tensorflow_hub/module.py in __init__(self, spec, trainable, name, tags) 160 raise ValueError("No such graph variant: tags=%r" % tags) 161 --> 162 abs_state_scope = _try_get_state_scope(name, mark_name_scope_used=False) 163 self._name = abs_state_scope.split("/")[-2] 164 /opt/conda/lib/python3.6/site-packages/tensorflow_hub/module.py in _try_get_state_scope(name, mark_name_scope_used) 393 raise RuntimeError( 394 "variable_scope %s was unused but the corresponding " --> 395 "name_scope was already taken." % abs_state_scope) 396 return abs_state_scope 397 RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.

sbecon commented 4 years ago

I have the same issue

frozenzo commented 3 years ago

Still hampered by the same issue for the time, is there any (hack) solution?

arnoegw commented 3 years ago

This won't be fixed for TF1 and the libraries that target it (hub.Module, Estimator).

For TF2, Keras, and the TF2 SavedModels loaded from TF Hub with hub.KerasLayer, the usual way of building and compiling a Keras model under a tf.distribute.MirroredStrategy and then calling .fit()on a tf.data.Dataset should just work. What we don't have yet is a great example to demonstrate that, say, on a multi-GPU machine on Google Cloud.

maringeo commented 3 years ago

TF Hub's make_image_classifier tool has been updated to use tf.data.Dataset and to demonstrate distributed training, including multi-GPU: https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier.

The make_image_classifier code is not a minimal working example, but as https://github.com/tensorflow/hub/issues/64#issuecomment-777335474 says, a Keras model build under tf.distribute.MirroredStrategy that uses tf.data.Dataset should work on multi-GPU.

I plan to keep this issue open for a few weeks, in case anyone encounters any issues that I've missed during testing.

tensorflow / hub

Tensorflow Hub: Support multi-GPU training in Keras or Estimator #64