Closed sp1cefl0w closed 2 years ago
Unfortunately, ParallelRunStep cannot produce multiple outputs (OutputFileDatasetConfig or PipelineData) the same way as PythonScriptStep.
PythonScriptStep executes on a single compute node, and ParallelRunStep works differently as it distributes the workload across a compute cluster and executes multiple tasks in parallel. With output_action="append_row", ParallelRunStep combines the results from each task that's executed on different nodes into one output file - parallel_run_step.txt. And this combined file is the output of ParallelRunStep written to blob store.
We're trying to use a ParallelRunStep for data preprocessing and were wondering if it's possible to use an OutputFileDatasetConfig to register a "dataset of datasets" (a dataset of metadata). Our entry script writes each processed file as parquet to a local directory on the compute cluster (./outputs/data), but the only thing that actually gets written to our blob store using the strategy below is the parallel_run_step.txt. We successfully use this strategy on a different PythonScriptStep but it does not seem to work the same way with a ParallelRunStep. We've also tried with PipelineData instead of OutputFileDatasetConfig.