CopyExampleGen component

rcrowe-google commented 3 years ago

In cases where data does not need to be shuffled, this component will avoid using a Beam job and instead do a simple copy of the data to create a dataset artifact. This will need to be a completely custom ExampleGen and not extend BaseExampleGen in order to implement this behavior.

Does the data need to be pre-split, or will this component also do the split?
Should splitting be optional?

@rclough @1025KB

alxndrnh commented 1 year ago

I will take lead on this (if any one wants to join, feel free to). Working on project proposal and will bring to SIG TFX-Addons bi-weeklys

rclough commented 1 year ago

Just adding some comments here since we have a component internally that does this, but it's not particularly robust (but has sat around for 3ish years as-is because it's been good-enough). I've also helped at least one team develop a variation of the component to suit their needs. Over time we've realized a few inefficiencies with our implementation that may help with the development of a component like this.

Overarching problem with TFX's split implementation

The reason why this component is useful is due to TFX's implementation of Examples artifacts, in particular how splits are managed. In order for a dataset to be usable as an Examples artifact, it must have a specific directory structure:

└── path/to/directory
    ├── Split-train
    │   └─ some_data.tfrecord
    │   └─ ....tfrecord
    ├── Split-eval
        └─ some_other_data.tfrecord
        └─ ....tfrecord

This means that you can't just take a dataset or set of datasets from anywhere, use an Importer and pass that data to a component unless your source data already is already formatted that way. In many cases it's not too hard to get the team to output their data in this format, but there are cases where it's not possible.

Currently, the only way to handle this is to use a ImportExampleGen. If you have pre-existing splits you have to do a kind of hacky approach with the input_config where the base URI is gs:// and the splits are the rest of the fully qualified URI.

In addition to it being non-obvious, it's also very inefficient, especially for large datasets (TB+), as a whole dataflow job with a shuffle has to be kicked off.

For this reason, a copy-based component is much preferred and simpler.

Typical User Journeys

Team A has a "golden evaluation" dataset that it uses to evaluate models against a standard baseline while the model retrains over time. Naturally, the single golden eval data is in it's own location separate from the training data, so the data must be copied along with the training data to be used as an Examples artifact in their training pipeline
Team B wants to have a batch inference pipeline. Their dataset exists in a single location, but due to that dataset being a dependency to other non-TFX tasks, the data engineers cannot change the directory structure to give it a single split folder. This component would be used to create an Examples artifact with one split by copying the data into the correct structure.

Insights from existing work

While I didn't work on the component myself, we had an engineer who did some basic benchmarking of a few approaches, which happened in 2019, so may be out of date. He compared the python gfile API in combination with multi thread/processing agaainst shelling out to gsutil. I dont have stats on the test dataset, but the results are summarized as so:

Gfile single threaded: 236sec
gsutil: 7sec
Gfile multi-threaded: 88 sec
Gfile multi-process: 77sec

There were some significant cons noted to using gsutil, which made it much less appealing despite its superior performance:

Requires a custom docker image
Not python native - Need to use shell to run it
Less maintainable/reliable/testable than python native solutions
Coupling with gcs file storage. (Can’t copy files in local env)

Our component went with a multi-process gfile approach, but we learned later that this has some significant downsides, which is that for very large datasets (example dataset was multiple TB with 4.4gb shards), we would either OOM or hit the iops limit of the docker container, since I guess it writes the data being copied temporarily to disk. This can be mitigated by allowing user control of threading (which defaulted to a multiple of machine cores)

Conclusion Summary (TL;DR)

In my experience, the ideal component would allow you to pass a dict mapping of split_name: uri, or instead of uri, an ExternalArtifact artifact works just the same. This way you have flexibility in the splits you want to include.

Choose the underlying copy implementation wisely/test with large datasets if possible for robustness. Gfile has issues with performance, and gsutil has issues with maintainability (not python native) and flexibility (ie can't be used in local mode with local files). It may be possible that the APIs have improved, for example this thread was unresolved at the time of our implementation.

alxndrnh commented 1 year ago

Hey @rclough, thanks for the helpful input and suggestions as we try and develop this project proposal. I have a question in regards to your conclusion, the ideal component would allow the user to:

"Pass in a dict mapping of split_name: uri" - does this imply the component should allow for the passing of multiple URIs such as path/to/directory/split_name1 and path/to/directory/split_name2? For example, a split_train: uri from example folder A and a split_eval: uri from example folder B?

If this is the case, we could add a function to validate if the uri resembles that of an Example Artifact directory structure.

"Pass an ExternalArtifact" - what is the definition of an ExternalArtifact? Could this be a an entire dataset artifact with it's respective split_name: uri coming from a separate cloud provider?

rclough commented 1 year ago

I think these questions both kind of highlight a need for clarity of input that I'd glossed over-

Should the component expect URIs, or some form of artifact?
What form should the URIs/Artifacts take?

I don't aim to answer these conclusively but to bring up some considerations.

I would argue that it is more helpful to avoid expecting an Examples artifact as input - that's typically the output of an ExampleGen component. In the case of many at my company, half the need for a copying component is because the inputs are not an Examples artifact (ie don't have splits), so validating the inputs resemble that structure would not be a good idea. The alternative then is either the component takes raw URIs, or there's an artifact to represent those URIs that can be used with an Importer.

My mistake on mentioning ExternalArtifact - I hadn't realized it was actually an internal invention. We use it as a generic artifact type for items that are coming from outside of TFX. They are often used for inputs to custom components (where the component/authors have some understanding of how its expected to be used), or in our custom ExampleGen components that involve taking various URIs for data that haven't been formatted for TFX.

So ultimately my thought was a dict input like this example:

{
    "train": "gs://some/path/to/train_data"
    "eval": "gs://golden_eval_data"
    "extra_holdout": "gs://somewhere/else/data"
}

Or alternatively where those URIs are actually artifacts. The component would loop through and create a split for each key, and copy the data from the value URI to os.path.join(output_uri, split_name_key)

Lastly, regarding other cloud providers, that's actually a really good question, and is probably part of a tradeoff that must be made when choosing the copy implementation. I'm sure it would be great if many common cloud APIs were supported, but you probably need to draw a reasonable scope. In my case, we only use local directories in GCS, but I'm not sure how much is realistic to support, multiplied with the performance considerations (ie, something like gfile might be more generic and support s3 etc, but may not perform as well as gsutil for gcs)

tensorflow / tfx-addons