nils-braun / b2luigi

Task scheduling and batch running for basf2 jobs made simple
GNU General Public License v3.0
17 stars 11 forks source link

stop the task_iterator from re-traversing sections of the graph it has already seen. #186

Closed MarcelHoh closed 1 year ago

MarcelHoh commented 1 year ago

On the batch workers the graph is locally reproduced until the task_id of the task to run is found. In doing so if N tasks require a common input file that task and the structure of tasks it requires will be traversed N times.

This PR adds a simple set that stores what tasks have already been checked. And only checks the required tasks if the parent task has not already been seen.

meliache commented 1 year ago

Thanks for the PR.

I still don't understand why the task iterator is given a task list with duplicate tasks ID's in the first place, but my understanding of the b2luigi core is also limited. Afaik the already_seen_tasks cache will not be globally shared between batch workers as each batch worker runs a separate b2luigi python process, therefore it only helps if the run_as_batch_worker is given a task list with multiple duplicate tasks on each remote worker node, which I don't understand why that would happen.

Generally, making sure that task_iterator returns a unique list of tasks seems like it shouldn't hurt, but since that function is a deep part of the b2luigi core I'd like to really make sure that this change doesn't have unwanted/unexpected side effects but for that I'd like to have the second opinion of somebody how understand the code well. Sadly, due to PhD and paper deadlines I don't have much time to work on that currently.

meliache commented 1 year ago

Also please install pre-commit for running and automatic fixing of syntax/style checks that we defined for this repository when doing a commit, see also https://b2luigi.readthedocs.io/en/stable/advanced/development.html:

pip3 install pre-commit
pre-commit install
pre-commit run  # is also run automatically before each commit

In the automatic tests for this repo I see that some pep8 style checks are failing e.g. due to whitespace around = in keyword parameters, pre-commit can fix that for you.

And if we decide to include this PR, an entry in the CHANGELOG.md is also necesseary.

MarcelHoh commented 1 year ago

Each worker runs an executable which traverses the full graph from the root task until it finds a task with the task id supplied, of which there can only ever be one. For example: exec python3 steering_pipeline.py --batch-runner --task-id presel_and_flag_task_ssbar__0_0__0_1_____pnfs_desy_de__bf10ad67ac

The problem is that the task iterator can waste a lot of time traversing sections of the graph it has already gone down. If I require task A 1000 times with different parameters and each task A requires the same instance of task B, which in turn requires several tasks C and D, then run_as_batch_worker will check the B->C->D path of the graph 1000 times even though we already know that the task_id we are searching for is not within this path.

There is no rush on this from my end, all the best with submission! I'm happy to keep using it and seeing if there are any issues. Sorry for the pre-commit issues, I must admit I was a bit lazy and just duplicated the changes I have in my local through the github online editor.

meliache commented 1 year ago

Oh I understand now which problem this solves :facepalm:, I misunderstood at first but the problem was on my side, sorry. The problem is that the task graph is not a tree but a DAG and thus multiple tasks within a single task graph can require the same task, but a simple recursive tree-traversal revisits them. Probably something that one could google algorithms for as well but the cache that you implemented seems good enough.

So in the case of a graph that looks like (made with mermaid):

graph TD;
    Main-->TaskA;
    Main-->TaskB;
    Main-->TaskC;
    TaskA-->SomeExternalTask;
    TaskB-->SomeExternalTask;
    TaskC-->SomeExternalTask;

The task_iterator currently returns

[Main, TaskA, SomeExternalTask, TaskB, SomeExternalTask, TaskC, SomeExternalTask]

With your PR, it returns

[Main, TaskA, SomeExternalTask, TaskB, TaskC]
meliache commented 1 year ago

Sorry this PR was stale for a bit, I'm busy with my thesis but right now I'm enjoying my procrastination Friday.

I found the code change you proposed great, but you didn't react to my comments yet. But I found that I can add commits to the PR, so I started tackling those myself. Here's a list of issues that I saw with the PR and I crossed out those that I already solved, explanations below:

The main thing is that I rewrote the function to avoid the already_seen_tasks parameter. I think it shouldn't be visible to the end user. But somehow we have to keep track of the already seen tasks in the iteration. The trick that I used is to just initalize the set in the function body and then create a second temporary function inside the function, which does that iteration, and has a reference to that set.

I also fixed the style issues, added a changelog and also I added you, @MarcelHoh, to the list of contributors in the manual. The one big missing thing is to add at least unit test to make sure we didn't mess anything up. Also a good unit test serves as documentation of the intended behaviour.

meliache commented 1 year ago

In ded037027b7c45209102ed5d9bf01491bf15cebf I added a unittest which imo does a good job of defining the new task iterator order: https://github.com/nils-braun/b2luigi/blob/a370bb51a6ddd877eb12a912e55f79db7a3d6dd5/tests/core/test_utils.py#L233-L266 After knowing it works as expected I was confident to merge this, so now it's in.

MarcelHoh commented 1 year ago

Hi, sorry for my silence, I got quite busy :(. Thanks for all your work on this!

meliache commented 1 year ago

No problem, same for me. You made a really good observation and solution, but since this touches the core of b2luigi I wanted to make sure it does what's expected and is well tested. But the problem itself is something that could almost be a programming puzzle like a task on leetcode, so that was fun.