spotify / luigi

Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Apache License 2.0
17.71k stars 2.39k forks source link

SVG visualisation breaks on dependencies between tasks with the same name #3249

Open ewallace-RI opened 1 year ago

ewallace-RI commented 1 year ago

I recently updated from 3.0.0 to 3.3.0 and found that in the SVG visualiser, my existing graphs are either hopelessly scrambled or fail to render anything except the top task, which turns out to be due to changes made in this PR. There are two basic issues:

  1. Visualisation is broken in the presence of dependencies between tasks with the same name, which the original PR intended to have fall back to the old style. There are two failure modes (details below) - either the same-name dependency is not recognized and the new style gives a nonsensical representation, or it is recognized, and the fallback to the old style fails to render anything.

  2. The new style of visualisation is much less readable in my use cases. I can see how it's an improvement for the example case in the PR, but for a graph where many different tasks with different names are expected to run concurrently, forcing each task onto a separate vertical layer (especially with randomized horizontal placement) makes the graph extremely difficult to parse visually. There should at least be an option to revert to the previous style, independent of any automatic behavior.

The two failure modes in point 1 depend on whether two tasks with the same name depend on each other directly, or only indirectly, separated by other tasks. The indirect case might look something like this:

import luigi

class TestTask(luigi.Task):
    def run(self):
        with self.output().open('w') as f:
            pass

    def output(self):
        return luigi.local_target.LocalTarget(path=f'{self.task_id}.done')

class Foo(TestTask):
    def requires(self):
        return Bar(second_time=True)

class Bar(TestTask):
    second_time = luigi.BoolParameter()
    def requires(self):
        if self.second_time:
            return Baz()

class Baz(TestTask):
    def requires(self):
        return Bar()

Here, we want to run Bar once, then run Baz, then run Bar again, then run Foo. The old style of visualization correctly renders this as a straight line: image

In the current version, Bar(second_time=False) is not recognized as a dependency of Bar(second_time=True), and they are forced to the same depth, breaking the directedness of the graph: image

In the case of a direct dependency, it gets worse. If we remove the intermediary Baz() task and let Bar(second_time=True) directly require Bar(second_time=False), like so:

import luigi

class TestTask(luigi.Task):
    def run(self):
        with self.output().open('w') as f:
            pass

    def output(self):
        return luigi.local_target.LocalTarget(path=f'{self.task_id}.done')

class Foo(TestTask):
    def requires(self):
        return Bar(second_time=True)

class Bar(TestTask):
    second_time = luigi.BoolParameter()
    def requires(self):
        if self.second_time:
            return Bar()

we get this: image

Here, the same-name dependency is recognized, triggering the intended fallback to the old behavior, which computes the positions of all child tasks as (NaN, NaN):

image

starhel commented 4 months ago

I have encountered the same issue - I have tasks that iteratively trigger themselves (aggregating data in 7-day buckets -> 28-day buckets -> N-day buckets). @dlstadther would there be a chance for you to take a look at it when you have a moment (as reviewer of https://github.com/spotify/luigi/pull/3122)?

dlstadther commented 4 months ago

Based on this issue, i understand there to be rendering issues with the SVG visualization here in Luigi. I have no insight to provide towards a solution. I welcome any contributions to the project and I'll do my best to provide reviews (though time has been in short supply for a few years now as my day job no longer uses Luigi).

ewallace-RI commented 4 months ago

To my mind, the solution would simply be to revert the changes from https://github.com/spotify/luigi/pull/3122, which broke the existing functionality for the sake of someone's aesthetic preference in their use case. Given that the change had been in place for 2 years by the time I upgraded and noticed it, I figured that would be an unpopular solution, and opted to just overwrite that .js file in my local installation.

I didn't see an obvious way to make the fallback to the old behavior in the presence of dependencies between same-named tasks work while keeping the new behavior otherwise, and since the new behavior is useless to me, I didn't have much motivation to put more effort into finding a fix that would preserve it.

dlstadther commented 4 months ago

@joooeey , being the author of the PR which is causing issues for the use cases mentioned above by @ewallace-RI and @starhel , do you see a clear path to resolution?

starhel commented 4 months ago

I've fixed case with (NaN, NaN) in; https://github.com/spotify/luigi/pull/3287 For now I have no time to find solution for non-directly dependent tasks.

joooeey commented 4 months ago

I don't use Luigi anymore, so I don't have the time to contribute a tested solution. However, I can see that this line in the visualizer code only checks for direct dependencies to tasks of the same name. To also capture indirect dependencies, the $.each loop would have to be adjusted to build up an array of known names. Then child_node.name can be compared with that array.

3287 is also necessary and it should be checked how that PR interacts with the change I propose in the previous paragraph.