microsoft / solution-accelerator-many-models

MIT License
192 stars 87 forks source link

Using OutputFileDatasetConfig with a ParallelRunStep #130

Closed sp1cefl0w closed 2 years ago

sp1cefl0w commented 3 years ago

We're trying to use a ParallelRunStep for data preprocessing and were wondering if it's possible to use an OutputFileDatasetConfig to register a "dataset of datasets" (a dataset of metadata). Our entry script writes each processed file as parquet to a local directory on the compute cluster (./outputs/data), but the only thing that actually gets written to our blob store using the strategy below is the parallel_run_step.txt. We successfully use this strategy on a different PythonScriptStep but it does not seem to work the same way with a ParallelRunStep. We've also tried with PipelineData instead of OutputFileDatasetConfig.


output_dir = OutputFileDatasetConfig(name="etl_prepped", destination=(DEFAULT_DATASTORE, 'data/etl_prepped'), source='./outputs/data/').register_on_complete('preprocessed_files')

parallel_run_config = ParallelRunConfig(
      source_directory=parent_dir,
      entry_script='etl.py',
      mini_batch_size="1",
      run_invocation_timeout=timeout,
      error_threshold=10,
      output_action="append_row",
      environment=CURATED_ENVIRONMENT,
      process_count_per_node=processes_per_node,
      compute_target=COMPUTE_TARGET,
      node_count=node_count
  )

parallel_run_step = ParallelRunStep(
  name="etl",
  parallel_run_config=parallel_run_config,
  inputs=[small_seeq_cache],
  output=output_dir,
  allow_reuse=False,
)
tracychms commented 3 years ago

Unfortunately, ParallelRunStep cannot produce multiple outputs (OutputFileDatasetConfig or PipelineData) the same way as PythonScriptStep.

PythonScriptStep executes on a single compute node, and ParallelRunStep works differently as it distributes the workload across a compute cluster and executes multiple tasks in parallel. With output_action="append_row", ParallelRunStep combines the results from each task that's executed on different nodes into one output file - parallel_run_step.txt. And this combined file is the output of ParallelRunStep written to blob store.