Closed rsethur closed 3 years ago
Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.
If anyone else is hampered by this issue as well, please speak up here.
Hello @arnoegw , Can you please provide me more guidance/some pseudo code would help. Tf-Hub + Estimators have awesome potential for developers - ironing out these kinks would definitely help.
I very much agree: it would be great to iron out the kinks that prevent straightforward use of Hub modules with multi-GPU Estimators. Unfortunately, at this time, I neither have that code, nor worked-out example code for the hack around that I sketched above. Sorry.
Leaving this open for the feature request...
+1 The same problem when use estimator.
I also look forward to trying multiGPU with tf-hub
+1 Same issue here. Would like to use tf-hub with estimators and multi GPU.
In the meantime it would also be great with some pseudo code or more detailed explanation on how to hack around it would be really appreciated.
+1 Same problem when using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .
Did anyone manage to conjure a working hack for this? I was unable to get it to work through a tf.collection
Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.
If anyone else is hampered by this issue as well, please speak up here.
Where would the shared instance have to be created?
Doing something like this in the model_fn does not work:
if len(tf.get_collection(
"SHARED_ELMO_INSTANCE_COLLECTION",
scope=None
)) == 0:
elmo = hub.Module("https://tfhub.dev/google/elmo/2", name="ELMO", trainable=True)
tf.add_to_collection(
"SHARED_ELMO_INSTANCE_COLLECTION",
elmo
)
elmo = tf.get_collection(
"SHARED_ELMO_INSTANCE_COLLECTION",
scope=None
)[0]
elmo_representations = elmo(
inputs={
"tokens": tokens,
"sequence_len": tokens_length
},
signature="tokens",
as_dict=True)["elmo"]
+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .
+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .
Came here to report that it is still not fixed. I hope they fix it soon.
@arnoegw Any update or ETA on this ?
Hi all, thanks for your patience. We understand that multi-GPU training is important. While it was possible in low-level TensorFlow early on, its support by high-level frameworks has been a moving target. With the advent of TensorFlow 2 (see the recent Dev Summit), both sides of the story are changing again, but for the better:
tf.saved_model.load()
. Under the hood, this provides a clean separation of computation
and state, which helps the cause.So the TF2 version of this feature request is DistributionStrategy support for model pieces brought in by loading a SavedModel, preferably through Keras (not low-level TF). This is on the radar for the TensorFlow and TF Hub teams, but there is no specific timeline.
tf.contrib.estimator.replicate_model_fn is deprecated by now. We do not plan to go back and work on supporting it. Let me change the issue title accordingly....
For those especially interested in retraining of image models faster than with retrain.py:
If you are ready to live on the cutting edge of TF 2.0.0alpha0, take a look at Hub's examples/colab/tf2_image_retraining.ipynb which is considerably smaller, faster (if you use a GPU), and even supports fine-tuning the image module. However, this is still with a single GPU.
Thanks for your report. Unfortunately, the straightforward way of instantiating a hub.Module in the model_fn of an Estimator does not currently work with tf.contrib.estimator.replicate_model_fn and how it calls the same model_fn repeatedly. To hack around this, one would have to to share hub.Module instances for each graph that model_fn gets called in (e.g., through a custom collection). After that, applying a Module object multiple times should basically just work.
If anyone else is hampered by this issue as well, please speak up here.
Really hampered by this issue.
From what I understand tensorflow_hub.Module._try_get_state_scope is complaining because the embeddings are trying to be placed on all available GPUs.
one would have to to share hub.Module instances **for each graph**
that model_fn gets called in
A little more detail on what is meant by that sentence would go along way. Not asking for a solution but some pseudo-code could be great.
I am really hampered by this issue as well.
@arnoegw Many thanks for the development. Question: How is Hub positioned in comparison to the Keras applications models - seems to be quite similar. Will there be some unification in the future? Also some of the models does not support fine tuning (object detection) - do you plan to fix this in future releases?
Thanks again!
@rsethur: There are no plans for unification at this time. TF Hub overlaps with Keras Applications for the particular case of reusing CNNs for image classification / feature extraction, but TF Hub offers modules (sometimes entire models) for a number of other domains, and requires neither the module consumer nor the module publisher to use Keras.
@gobrewers14, @r-wheeler: There is no great solution for TF1, but for TF2, there are the plans I described on March 15, and the already available examples/colab/tf2_image_retraining.ipynb with decent fine-tuning performance on a single GPU. Hope that helps.
+1 I'm having the same problem using estimator, tf-hub with multi GPU(tf.contrib.distribute.MirroredStrategy()) .
I have the same problem with tf-hub and estimator and very disappointed by the response of tf team. sadly, one version to another, there are lots of changes in tensorflow.
+1 I'm having the same problem using estimator, tf-hub with multi GPU tf.contrib.distribute.MirroredStrategy(num_gpus=8) .
I am also getting the same error: "RuntimeError: variable_scope module_8/ was unused but the corresponding name_scope was already taken."
Still not fixed I believe. Please help!
RuntimeError: variable_scope module_3/ was unused but the corresponding name_scope was already taken.
Full Traceback:
RuntimeError Traceback (most recent call last)
I have the same issue
Still hampered by the same issue for the time, is there any (hack) solution?
This won't be fixed for TF1 and the libraries that target it (hub.Module, Estimator).
For TF2, Keras, and the TF2 SavedModels loaded from TF Hub with hub.KerasLayer, the usual way of building and compiling a Keras model under a tf.distribute.MirroredStrategy and then calling .fit()
on a tf.data.Dataset
should just work. What we don't have yet is a great example to demonstrate that, say, on a multi-GPU machine on Google Cloud.
TF Hub's make_image_classifier
tool has been updated to use tf.data.Dataset and to demonstrate distributed training, including multi-GPU: https://github.com/tensorflow/hub/tree/master/tensorflow_hub/tools/make_image_classifier.
The make_image_classifier
code is not a minimal working example, but as https://github.com/tensorflow/hub/issues/64#issuecomment-777335474 says, a Keras model build under tf.distribute.MirroredStrategy that uses tf.data.Dataset
should work on multi-GPU.
I plan to keep this issue open for a few weeks, in case anyone encounters any issues that I've missed during testing.
In my project I use Tf-Hub with estimators. However when I try to use multi GPU's (single machine) using tf.contrib.estimator.replicate_model_fn, I get the following error:
Probably it is from this source line : link
Any help is much appreciated - received with thanks.
CC: @arnoegw