spotify / luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache License 2.0
17.85k stars 2.39k forks source link

CPU & Memory consumption seems very high when running the local scheduler #3035

Open eamonnfaherty opened 3 years ago

eamonnfaherty commented 3 years ago

I have built a workflow that just calls APIs and does very little processing with tiny json documents. The workflow runs in waves where one wave cannot run until all previous waves have completed. There are around 10 waves and each wave comprises of around 1400 tasks.

Some of tasks in each wave depend on tasks that run at the beginning before even the first wave runs.

This all runs in a local scheduler executed in a controlled environment.

I am seeing very high CPU and high memory usage.

BTW, I am seeing higher CPU and memory usage in the latest version.

I looked into the code before when this was a problem and I noticed the workers are forked processes which appeared to be causing a large spike.

I noticed switching over to the shared scheduler eased this problem significantly.

Are there any strategies or guidance for keeping memory usage low? For example should I be using static dependencies in requires or would dynamic dependencies through yielding in the run be better?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If closed, you may revisit when your time allows and reopen! Thank you for your contributions.