Closed feelingsonice closed 1 year ago
@pindinagesh is this a known issue? I don't see any other discussions on it but I happens to me on both 1.6.1
as well as 1.7.0
.
Is your module file on GCS (accessible by cloud)?
Is your module file on GCS (accessible by cloud)?
You mean the dataflow job don't have access to the GCS bucket?
If so, it's possible, but I also don't see any logs stating that. I'm assuming there'd be obvious permission error logs.
For context it only happens when I'm running it with DataflowRunner
. The direct runner completes just fine so I don't see why the permission wouldn't be provisioned to the dataflow runner.
it has, your transform component need to access the module file, your dataflow runner don't need to access module file
So I tried setting force_tf_compat_v1=True
it didn't work for me
It also seem like, from the issue you linked. That it happened when op used import_utils.import_func_from_source
. im not using that.
I'm also on 1.7.0. Is this issue still present? Seems like a very common usage case and I'm not doing anything complex.
so to confirm, your module file is on GCS, right?
so to confirm, your module file is on GCS, right?
Yes
Just curious, is this example (Kubeflow instead of Vertex) working for you
Can you link me the colab? I can try it.
@1025KB I tried the example. It did not work. I got the same error. But it's worth noting that I had to switch the DAG runner from kubeflow to kubeflowV2. Did it because the V1 runner generates a yaml that vertex ai for some reason dont accept. Maybe there's a easy fix but didn't have time to dig into it.
Also I have a corporate gcp account and I had to manually push the data root to my own gcs bucket. I dont have permission other wise
The kubeflowDagRunner should work because of this
I created a PR to add that to kubeflowDagRunner V2, but currently there is a bug in PR sync so the PR status became weird.
what I did is I added the following to KubeflowV2DagRunner.run():
for component in pipeline.components:
# TODO(b/187122662): Pass through pip dependencies as a first-class
# component flag.
if isinstance(component, tfx_base_component.BaseComponent):
component._resolve_pip_dependencies( # pylint: disable=protected-access
pipeline.pipeline_info.pipeline_root)
Hmm. Sounds like I just need to switch kubeflowDagRunner
to KubeflowV2DagRunner
? Could you give me a quick summary of what's going here?
Also not sure if vertex ai just dont support kubeflowDagRunner
or im missing something here but the pipelines on vertex ai only supports json. And kubeflowDagRunner don't seem to produce that.
KubeflowDagRunner is for Kubeflow KubeflowV2DagRunner is for Vertex
KubeflowDagRunner should work, and we found a potential fix[1] for KubeflowV2DagRunner. But you mentioned you saw the same "No module named 'user_module_0'" error on both Kubeflow and Vertex, that I'm not sure what's happening...
[1] Adding this to KubeflowV2DagRunner.run function:
for component in pipeline.components:
# TODO(b/187122662): Pass through pip dependencies as a first-class
# component flag.
if isinstance(component, tfx_base_component.BaseComponent):
component._resolve_pip_dependencies( # pylint: disable=protected-access
pipeline.pipeline_info.pipeline_root)
But you mentioned you saw the same "No module named 'user_module_0'" error on both Kubeflow and Vertex
I didn't. I'm strictly using vertex here.
I see, them I misunderstand.
For Vertex, before our PR is in, you can try add that code to KubeflowV2DagRunner.run, and retry?
Ok can confirm it works. Do you know when the change will be released?
next release about a month
@1025KB So im now running into:
RuntimeError: The order of analyzers in your `preprocessing_fn` appears to be non-deterministic. This can be fixed either by changing your `preprocessing_fn` such that tf.Transform analyzers are encountered in a deterministic order or by passing a unique name to each analyzer API call.
It might just be my own code here, but I can't find what could possibly non-deterministic here and this only appears when I add the --runner=DataflowRunner
flag. For reference, my beam_pipeline_args
looks like:
BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS = [
'--project=' + GOOGLE_CLOUD_PROJECT,
'--temp_location=' + os.path.join('gs://', GCS_BUCKET_NAME, 'tmp'),
'--runner=DataflowRunner',
'--region=us-central1',
'--experiments=upload_graph', # upload_graph must be enabled
'--dataflow_service_options=enable_prime',
'--autoscaling_algorithm=THROUGHPUT_BASED',
]
And my preprocessing_fn
(redacted for conciseness):
_FEATURES = [# list of str
]
_SPECIAL_IMPUTE = {
'special_foo': 1,
}
HOURS = [1, 2, 3, 4]
TABLE_KEYS = {
'XXX': ['XXX_1', 'XXX_2', 'XXX_3'],
'YYY': ['YYY_1', 'YYY_2', 'YYY_3'],
}
@tf.function
def _divide(a, b):
return tf.math.divide_no_nan(tf.cast(a, tf.float32), tf.cast(b, tf.float32))
def preprocessing_fn(inputs):
x = {}
for name, tensor in sorted(inputs.items()):
if tensor.dtype == tf.bool:
tensor = tf.cast(tensor, tf.int64)
if isinstance(tensor, tf.sparse.SparseTensor):
default_value = '' if tensor.dtype == tf.string else 0
tensor = tft.sparse_tensor_to_dense_with_shape(tensor, [None, 1], default_value)
x[name] = tensor
x['foo'] = _divide((x['foo1'] - x['foo2']), x['foo_denom'])
x['bar'] = tf.cast(x['bar'] > 0, tf.int64)
for hour in HOURS:
total = tf.constant(0, dtype=tf.int64)
for device_type in DEVICE_TYPES.keys():
total = total + x[f'some_device_{device_type}_{hour}h']
# one hot encode categorical values
for name, keys in TABLE_KEYS.items():
with tf.init_scope():
initializer = tf.lookup.KeyValueTensorInitializer(
tf.constant(keys),
tf.constant([i for i in range(len(keys))]))
table = tf.lookup.StaticHashTable(initializer, default_value=-1)
indices = table.lookup(tf.squeeze(x[name], axis=1))
one_hot = tf.one_hot(indices, len(keys), dtype=tf.int64)
for i, _tensor in enumerate(tf.split(one_hot, num_or_size_splits=len(keys), axis=1)):
x[f'{name}_{keys[i]}'] = _tensor
return {name: tft.scale_to_0_1(x[name]) for name in _FEATURES}
@1025KB is this fixed in the latest release?
Hi, @bli00
Apologies for the delay and I found similar issue #1696, user has found some workaround here and It seems like version compatibility issue between Kubeflow Pipelines Backend and TFX so I would request you to please check Kubeflow Pipelines Backend and TFX Compatibility Matrix here and also check Upgrading Kubeflow Pipelines deployment on Google Cloud, for your reference I have found one good article, I hope it will you to resolve your issue
Could you please try to run your TFX pipeline as per Compatibility Matrix versions and check is it resolving your issue ?
If issue still persists, please let us know and if possible please help us with error log to do further investigation to find out root cause for your issue ?
Thank you!
Hi, @bli00
Closing this issue due to lack of recent activity for couple of weeks. Please feel free to reopen the issue or post comments, if you need any further assistance or update. Thank you!
If the bug is related to a specific library below, please raise an issue in the respective repo directly:
TensorFlow Data Validation Repo
TensorFlow Model Analysis Repo
TensorFlow Transform Repo
TensorFlow Serving Repo
System information
pip freeze
output):Describe the current behavior
When using
KFP version: 1.8.11
on Google Colab, running the pipeline withbeam_pipeline_args
--runner=DataflowRunner
, I get the error"ModuleNotFoundError: No module named 'user_module_0'"
. Full stacktrace in the screenshot attached.Describe the expected behavior
Trainer module. This is taken straight from the tutorial with some minor alterations:
The pipeline definition, also taken straight from the tutorial with minimum modifications:
Standalone code to reproduce the issue
Run it in google colab.
For reference I see these two issues are still not resolved:[1, 2]
I tried the solutions suggestion by setting
force_tf_compat_v1=True
. Still got the same error.It's also worth noting that my module is stored in GCS;
module_file
is a GCS URI.In addition, I'm not importing anything like the other 2 issues. I just have one trainer.py and I'm just trying to run the tutorials.