tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.11k stars 709 forks source link

Components cannot work with regular expression artifact uris anymore #1288

Closed martin-laurent closed 3 years ago

martin-laurent commented 4 years ago

Since this change: https://github.com/tensorflow/tfx/commit/66fe1a70a26e6ab9a2d01ffb3b62985752344167 integrated in TFX 0.15, it is not possible anymore to use a regular expression as a artifact uri, because the second check in verify_input_artifacts will fail: https://github.com/tensorflow/tfx/blob/936018566cd379c0aa5d1e1535eb052b782135da/tfx/components/base/base_driver.py#L81 However all components' executor implementation handles gracefully regular expression as far as I know.

This is drastic limitation, could you explain more why this change was needed, and if it would be possible to change the implementation (maybe using tf.io.gfile.glob as a fallback if the uri does not exists?)

andrewsmartin commented 4 years ago

I want to +1 this and add additional context. In my organization, we often save out additional metadata alongside the .tfrecord files, including a snapshot of the schema as well as additional metadata relating to our internal dataset registry. Many of the component executors - as mentioned already by @martin-laurent - use an "all files pattern" when reading tfrecord files, rather than something like *tfrecords*, which breaks for our case. So we end up having to copy and paste the executors, since it is not possible to supply a custom file pattern.

How hard would it be to allow passing in a custom file pattern? If these executors could be refactored a bit to have a def get_file_pattern() -> str method, we would be able to just override that without copying and pasting the entire .Do() method.

chongkong commented 4 years ago

Would you provide more context for using a glob pattern in Artifact URI? What I can tell you is that Artifact URI should point to the single directory, (not a multiple directories or files via glob) where it contains all the files that are only limited to that single Artifact (as we don't have a nested artifact concept yet). You're not supposed to share the directory for storing files from different Artifacts. That being said, you can simply use the parent directory of your artifact files as a Artifact URI.

In @andrewsmartin case, the recommended approach is to produce a different (custom defined) Artifact in a different URI.

zhitaoli commented 4 years ago

+1 for what @chongkong said: the URI is at least designed to point to a single directory. The system might have allowed other cases to operate so far but I highly suspect once we implement wipe-out/garbage collection, other assumptions would fail.

You can still use a glob pattern if it operates inside the single directory of the URI safely.

1025KB commented 4 years ago

example gen's input artifact is a directory, default we will glob every file in that dir, if you want to custom the input pattern, use input config

rmothukuru commented 4 years ago

@martin-laurent, Can you please respond to @1025KB's comment. Thanks!

martin-laurent commented 4 years ago

So in my use case, I am not using ExampleGen component in my pipeline. The reason why is that the training and test data comes from independant and relatively sophisticated data pipelines that are executed outside of Kubeflow. Next to the several TBs of tfrecords, we generate a handful of other files relative to data completeness and GDPR compliance. This is a company wide policy I cannot change and I am not willing to duplicate all this data just for the sake of having "just" tfrecords in my original artifats. So far, we had no issue working with a original artifacts' URI ending in *tfrecord* which would grab all use cases.

neuromage commented 4 years ago

@ruoyu90 @zhitaoli @1025KB @chongkong It seems ok to me to relax that check. If the file doesn't exist, then the executor will fail at runtime anyway. Reducing file-system checks from default driver seems like a reasonable request, and leads to more portable driver behaviour, so I'm in favour of removing the existence check altogether.

neuromage commented 4 years ago

/cc @numerology

andrewsmartin commented 4 years ago

Hi, just checking back in on this - @1025KB your suggestion to use InputConfig is a good one, but it seems that is only available for the ExampleGen components. I would like to be able to specify such a file pattern for any component that reads tfrecords using Beam - for example StatisticsGen, evaluator etc. Is there a good solution there? I don't want to work around any design assumptions TFX makes around treating artifact URIs as a single directory, so it seems like being able to specify file patterns across the board would be a good solution.

martin-laurent commented 4 years ago

Hi @numerology @chongkong any update on this?

chongkong commented 3 years ago

I was revisiting unresolved issues and found this today. We're still regarding the artifact to point to the single directory, and in this particular case I think passing multiple artifacts per each data source directory makes sense. Some components including Trainer are capable of handling multiple Examples artifacts in a single execution.

If you think a single artifact should have a multiple URIs or have other ideas, please reopen this issue or create a new one for proposals. I'm closing this for now.

google-ml-butler[bot] commented 3 years ago

Are you satisfied with the resolution of your issue? Yes No