nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

Only build the graph on the batch_worker for non complete tasks. #185

Closed MarcelHoh closed 1 year ago

MarcelHoh commented 1 year ago

I had an issue where the batch worker seemed to get stuck in infinite loops of task requirements (at least I saw it loop over 1+ million tasks before killing the jobs. This is quite strange as such loops are not present within the pipeline and the actual process call is able to build the graph without issue.

While I haven't fully solved the issue yet, I found setting the batch worker to not check dependents of completed tasks sped up the task processing and at least for now avoided the above issue. I'm still investigating my original issue but this change might be more broadly beneficial.

codecov-commenter commented 1 year ago

Codecov Report

Base: 59.20% // Head: 59.20% // No change to project coverage :thumbsup:

Coverage data is based on head (bbf9278) compared to base (1ff3782). Patch coverage: 100.00% of modified lines in pull request are covered.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #185 +/- ## ======================================= Coverage 59.20% 59.20% ======================================= Files 23 23 Lines 1554 1554 ======================================= Hits 920 920 Misses 634 634 ``` | [Impacted Files](https://codecov.io/gh/nils-braun/b2luigi/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun) | Coverage Δ | | |---|---|---| | [b2luigi/\_\_init\_\_.py](https://codecov.io/gh/nils-braun/b2luigi/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun#diff-YjJsdWlnaS9fX2luaXRfXy5weQ==) | `88.46% <100.00%> (ø)` | | | [b2luigi/cli/runner.py](https://codecov.io/gh/nils-braun/b2luigi/pull/185?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun#diff-YjJsdWlnaS9jbGkvcnVubmVyLnB5) | `43.75% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=Nils+Braun)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

MarcelHoh commented 1 year ago

Sorry, let me open a new PR with a better approach.

meliache commented 1 year ago

Hmnnn, I can't comment on your infinite loop issue without a minimal reproducible example. Regarding the code change for the run_as_batch_worker, I'm not eager to accept that without understanding why the existing option only_non_complete=True wasn't used in the first place, whether it was an oversight by @nils-braun or it's just a workaround for a bug that arises in your specific use-case but not in the general case. I also don't have the best understanding of the b2luigi core but run_as_batch_worker is as far as I understand the function that executes usually a single task on a remote batch worker in a separate remote b2luigi process that has been spawned with the --batch-runner cli argument. Therefore, usually, I would expect that run_as_batch_worker typically just receives a single task and it shouldn't know about the completion status of that task as it's a separate luigi process. So I'm confused how using only_non_complete=True would help there except something is horribly wrong, so I'm confused.

While writing this message I saw you closed the PR and announced you will do a new one. Just note I don't really have time for working on b2luigi due to a paper and my thesis deadline in a couple of weeks, so it might take until a weekend or a couple of weeks until I have time to properly review it.


Maybe the new option cache_task_completion added in https://github.com/spotify/luigi/pull/3178https://github.com/spotify/luigi/pull/3178https://github.com/spotify/luigi/pull/3178 and released in Luigi 3.1.1 might help with performance. You could try that option by updating to the latest luigi release and then setting that option in the worker configuration.

cache_task_completion By default, luigi task processes might check the completion status multiple times per task which is a safe way to avoid potential inconsistencies. For tasks with many dynamic dependencies, yielded in multiple stages, this might become expensive, e.g. in case the per-task completion check entails remote resources. When set to true, completion checks are cached so that tasks declared as complete once are not checked again. Defaults to false.

This is not a b2luigi setting, but a general luigi setting and thus should be set in a luigi.cfg or luigi.toml file, based on certain environment variables, so please see the Luigi configuration documentation on how to do that. For example, you could export the environment variables in your .bashrc

export LUIGI_CONFIG_PARSER=toml
export LUIGI_CONFIG_PATH=$HOME/luigi.toml

and create a file $HOME/luigi.toml with the contents

[worker]
cache_task_completion = true
MarcelHoh commented 1 year ago

Thanks for the quick feedback, and all the best for your submission ! I suspect only_non_complete was not enabled as this avoids the cost of checking whether each task is complete. I think therefore it makes sense to leave this as False as it stands currently. Let me continue discussion on the new PR as I see you are already reviewing it!