Open rcrowe-google opened 3 years ago
I will take lead on this (if any one wants to join, feel free to). Working on project proposal and will bring to SIG TFX-Addons bi-weeklys
Just adding some comments here since we have a component internally that does this, but it's not particularly robust (but has sat around for 3ish years as-is because it's been good-enough). I've also helped at least one team develop a variation of the component to suit their needs. Over time we've realized a few inefficiencies with our implementation that may help with the development of a component like this.
The reason why this component is useful is due to TFX's implementation of Examples
artifacts, in particular how splits are managed. In order for a dataset to be usable as an Examples
artifact, it must have a specific directory structure:
└── path/to/directory
├── Split-train
│ └─ some_data.tfrecord
│ └─ ....tfrecord
├── Split-eval
└─ some_other_data.tfrecord
└─ ....tfrecord
This means that you can't just take a dataset or set of datasets from anywhere, use an Importer and pass that data to a component unless your source data already is already formatted that way. In many cases it's not too hard to get the team to output their data in this format, but there are cases where it's not possible.
Currently, the only way to handle this is to use a ImportExampleGen. If you have pre-existing splits you have to do a kind of hacky approach with the input_config
where the base URI is gs://
and the splits are the rest of the fully qualified URI.
In addition to it being non-obvious, it's also very inefficient, especially for large datasets (TB+), as a whole dataflow job with a shuffle has to be kicked off.
For this reason, a copy-based component is much preferred and simpler.
Examples
artifact in their training pipelineExamples
artifact with one split by copying the data into the correct structure.While I didn't work on the component myself, we had an engineer who did some basic benchmarking of a few approaches, which happened in 2019, so may be out of date. He compared the python gfile API in combination with multi thread/processing agaainst shelling out to gsutil. I dont have stats on the test dataset, but the results are summarized as so:
There were some significant cons noted to using gsutil, which made it much less appealing despite its superior performance:
Our component went with a multi-process gfile approach, but we learned later that this has some significant downsides, which is that for very large datasets (example dataset was multiple TB with 4.4gb shards), we would either OOM or hit the iops limit of the docker container, since I guess it writes the data being copied temporarily to disk. This can be mitigated by allowing user control of threading (which defaulted to a multiple of machine cores)
In my experience, the ideal component would allow you to pass a dict mapping of split_name: uri
, or instead of uri, an ExternalArtifact
artifact works just the same. This way you have flexibility in the splits you want to include.
Choose the underlying copy implementation wisely/test with large datasets if possible for robustness. Gfile has issues with performance, and gsutil has issues with maintainability (not python native) and flexibility (ie can't be used in local mode with local files). It may be possible that the APIs have improved, for example this thread was unresolved at the time of our implementation.
Hey @rclough, thanks for the helpful input and suggestions as we try and develop this project proposal. I have a question in regards to your conclusion, the ideal component would allow the user to:
split_name: uri
" - does this imply the component should allow for the passing of multiple URIs such as path/to/directory/split_name1
and path/to/directory/split_name2
? For example, a split_train: uri
from example folder A and a split_eval: uri
from example folder B?split_name: uri
coming from a separate cloud provider?I think these questions both kind of highlight a need for clarity of input that I'd glossed over-
I don't aim to answer these conclusively but to bring up some considerations.
I would argue that it is more helpful to avoid expecting an Examples
artifact as input - that's typically the output of an ExampleGen component. In the case of many at my company, half the need for a copying component is because the inputs are not an Examples
artifact (ie don't have splits), so validating the inputs resemble that structure would not be a good idea. The alternative then is either the component takes raw URIs, or there's an artifact to represent those URIs that can be used with an Importer
.
My mistake on mentioning ExternalArtifact
- I hadn't realized it was actually an internal invention. We use it as a generic artifact type for items that are coming from outside of TFX. They are often used for inputs to custom components (where the component/authors have some understanding of how its expected to be used), or in our custom ExampleGen components that involve taking various URIs for data that haven't been formatted for TFX.
So ultimately my thought was a dict input like this example:
{
"train": "gs://some/path/to/train_data"
"eval": "gs://golden_eval_data"
"extra_holdout": "gs://somewhere/else/data"
}
Or alternatively where those URIs are actually artifacts. The component would loop through and create a split for each key, and copy the data from the value URI to os.path.join(output_uri, split_name_key)
Lastly, regarding other cloud providers, that's actually a really good question, and is probably part of a tradeoff that must be made when choosing the copy implementation. I'm sure it would be great if many common cloud APIs were supported, but you probably need to draw a reasonable scope. In my case, we only use local directories in GCS, but I'm not sure how much is realistic to support, multiplied with the performance considerations (ie, something like gfile might be more generic and support s3 etc, but may not perform as well as gsutil for gcs)
In cases where data does not need to be shuffled, this component will avoid using a Beam job and instead do a simple copy of the data to create a dataset artifact. This will need to be a completely custom ExampleGen and not extend BaseExampleGen in order to implement this behavior.
@rclough @1025KB