tensorflow / tfx

TFX is an end-to-end platform for deploying production ML pipelines
https://tensorflow.github.io/tfx/
Apache License 2.0
2.11k stars 709 forks source link

BeamDagRunner beam args vs beam_pipeline_args #1030

Closed luischinchillagarcia closed 4 years ago

luischinchillagarcia commented 4 years ago

BeamDagRunner mentions the following,

"""
beam_orchestrator_args: beam args for the beam orchestrator. Note that
        this is different from the beam_pipeline_args within
        additional_pipeline_args, which is for beam pipelines in components.
"""

It's clear that beam_orchestrator_args are beam arguments, however, it is less clear what arguments beam_pipeline_args and additional_pipeline_args has (and why they are different).

Would it be possible to get the list of arguments for beam_orchestrator_args, beam_pipeline_args, and additional_pipeline_args? In addition, a stronger description to remove the ambiguity between them?

numerology commented 4 years ago

Hi @luischinchillagarcia , to answer your first question: many stock version executors are implemented using beam pipeline, for example, CsvExampleGen, and the beam_pipeline_args is for that. beam_orchestrator_args is for the BeamDagRunner which orchestrates tasks whose executors can be arbitrary, so they are different.

In short, beam_orchestrator_args is for BeamDagRunner and beam_pipeline_args is for executors that use beam pipeline to do the job.

gowthamkpr commented 4 years ago

@luischinchillagarcia I hope @numerology answered your question pretty well. Can I close this issue?

luischinchillagarcia commented 4 years ago

@numerology Thank you for your response. It definitely clears up the difference between beam_orchestrator_args and beam_pipeline_args.

However, @gowthamkpr, I’m still left unsure about the relationship between ‘additional_pipeline_args` and the latter two arguments.

Secondly, does this mean that, assuming we are using the stock executors, all option arguments should be the exact same ones as beam arguments? In other words, both would have the exact same list of arguments that just work to specify for the orchestration or the individual components?

numerology commented 4 years ago

relationship between ‘additional_pipeline_args` and the latter two arguments.

beam_pipeline_args is a field in additional_pipeline_args. The latter is a nested dictionary. And these two args are 'orthogonal' to beam_orchestrator_args. additional_pipeline_args can also include other stuff like workflow id when running on KFP, etc.

In other words, both would have the exact same list of arguments that just work to specify for the orchestration or the individual components?

They are all beam arguments. One thing worth mentioning here is that the legitimacy of arguments can only be considered all together instead of individually. For example when we have runner=DataflowRunner we can specify project region etc, which make no sense when running locally. example

luischinchillagarcia commented 4 years ago

Perfect. Thank you for your response!