Closed martin-laurent closed 3 years ago
I want to +1 this and add additional context. In my organization, we often save out additional metadata alongside the .tfrecord
files, including a snapshot of the schema as well as additional metadata relating to our internal dataset registry. Many of the component executors - as mentioned already by @martin-laurent - use an "all files pattern" when reading tfrecord files, rather than something like *tfrecords*
, which breaks for our case. So we end up having to copy and paste the executors, since it is not possible to supply a custom file pattern.
How hard would it be to allow passing in a custom file pattern? If these executors could be refactored a bit to have a def get_file_pattern() -> str
method, we would be able to just override that without copying and pasting the entire .Do()
method.
Would you provide more context for using a glob pattern in Artifact
URI? What I can tell you is that Artifact
URI should point to the single directory, (not a multiple directories or files via glob) where it contains all the files that are only limited to that single Artifact
(as we don't have a nested artifact concept yet). You're not supposed to share the directory for storing files from different Artifact
s. That being said, you can simply use the parent directory of your artifact files as a Artifact
URI.
In @andrewsmartin case, the recommended approach is to produce a different (custom defined) Artifact
in a different URI.
+1 for what @chongkong said: the URI is at least designed to point to a single directory. The system might have allowed other cases to operate so far but I highly suspect once we implement wipe-out/garbage collection, other assumptions would fail.
You can still use a glob pattern if it operates inside the single directory of the URI safely.
example gen's input artifact is a directory, default we will glob every file in that dir, if you want to custom the input pattern, use input config
@martin-laurent, Can you please respond to @1025KB's comment. Thanks!
So in my use case, I am not using ExampleGen component in my pipeline. The reason why is that the training and test data comes from independant and relatively sophisticated data pipelines that are executed outside of Kubeflow. Next to the several TBs of tfrecords, we generate a handful of other files relative to data completeness and GDPR compliance. This is a company wide policy I cannot change and I am not willing to duplicate all this data just for the sake of having "just" tfrecords in my original artifats. So far, we had no issue working with a original artifacts' URI ending in *tfrecord*
which would grab all use cases.
@ruoyu90 @zhitaoli @1025KB @chongkong It seems ok to me to relax that check. If the file doesn't exist, then the executor will fail at runtime anyway. Reducing file-system checks from default driver seems like a reasonable request, and leads to more portable driver behaviour, so I'm in favour of removing the existence check altogether.
/cc @numerology
Hi, just checking back in on this - @1025KB your suggestion to use InputConfig is a good one, but it seems that is only available for the ExampleGen
components. I would like to be able to specify such a file pattern for any component that reads tfrecords using Beam - for example StatisticsGen, evaluator etc. Is there a good solution there? I don't want to work around any design assumptions TFX makes around treating artifact URIs as a single directory, so it seems like being able to specify file patterns across the board would be a good solution.
Hi @numerology @chongkong any update on this?
I was revisiting unresolved issues and found this today. We're still regarding the artifact to point to the single directory, and in this particular case I think passing multiple artifacts per each data source directory makes sense. Some components including Trainer are capable of handling multiple Examples artifacts in a single execution.
If you think a single artifact should have a multiple URIs or have other ideas, please reopen this issue or create a new one for proposals. I'm closing this for now.
Since this change: https://github.com/tensorflow/tfx/commit/66fe1a70a26e6ab9a2d01ffb3b62985752344167 integrated in TFX 0.15, it is not possible anymore to use a regular expression as a artifact uri, because the second check in
verify_input_artifacts
will fail: https://github.com/tensorflow/tfx/blob/936018566cd379c0aa5d1e1535eb052b782135da/tfx/components/base/base_driver.py#L81 However all components' executor implementation handles gracefully regular expression as far as I know.This is drastic limitation, could you explain more why this change was needed, and if it would be possible to change the implementation (maybe using
tf.io.gfile.glob
as a fallback if the uri does not exists?)