Using OutputFileDatasetConfig with a ParallelRunStep

We're trying to use a ParallelRunStep for data preprocessing and were wondering if it's possible to use an OutputFileDatasetConfig to register a "dataset of datasets" (a dataset of metadata). Our entry script writes each processed file as parquet to a local directory on the compute cluster (./outputs/data), but the only thing that actually gets written to our blob store using the strategy below is the parallel_run_step.txt. We successfully use this strategy on a different PythonScriptStep but it does not seem to work the same way with a ParallelRunStep. We've also tried with PipelineData instead of OutputFileDatasetConfig.


output_dir = OutputFileDatasetConfig(name="etl_prepped", destination=(DEFAULT_DATASTORE, 'data/etl_prepped'), source='./outputs/data/').register_on_complete('preprocessed_files')

parallel_run_config = ParallelRunConfig(
      source_directory=parent_dir,
      entry_script='etl.py',
      mini_batch_size="1",
      run_invocation_timeout=timeout,
      error_threshold=10,
      output_action="append_row",
      environment=CURATED_ENVIRONMENT,
      process_count_per_node=processes_per_node,
      compute_target=COMPUTE_TARGET,
      node_count=node_count
  )

parallel_run_step = ParallelRunStep(
  name="etl",
  parallel_run_config=parallel_run_config,
  inputs=[small_seeq_cache],
  output=output_dir,
  allow_reuse=False,
)

microsoft / solution-accelerator-many-models

Using OutputFileDatasetConfig with a ParallelRunStep #130