google / fhir-data-pipes

A collection of tools for extracting FHIR resources and analytics services on top of that data.
https://google.github.io/fhir-data-pipes/
Apache License 2.0
142 stars 82 forks source link

Investigate how the `maxWorkers` parameters affects the `number of Task Managers` #948

Closed chandrashekar-s closed 5 months ago

chandrashekar-s commented 5 months ago

The maxParallelism which is set by the maxWorkers over here, controls the maximum number of tasks to which keyed state can be distributed, i.e it is something like distributing the incoming streamed data over N keyed partitions. Few details can be found here. This looks like it will be used in a Streaming environment and this should not affect our pipelines as we don't have any streams.

Also, commenting on the number of task managers it is set to 1 by default in a local environment, can be found here. Increasing it in a local environment might not help much. However, in a clustered environment this gets dynamically calculated based on the parallelism set for operators in tasks (not the maxParallelism) and the number of taskSlots in a single TaskManager defined by this parameter.

Investigate further for a clustered environment on how the number of Task Managers are created dynamically and how is it dependent on the taskmanager.memory.network.max parameters.

chandrashekar-s commented 5 months ago

The maxParallelism in the FlinkPipelineOptions over here, is used to control the number of maximum number of tasks to which keyed state can be distributed, i.e, the maximum effective parallelism of an operator and this is applicable only in a Streaming environment. In our case since we don't have any pipelines using streams this parameter is not used anywhere. So the maxWorkers param will be removed.