mbruns91 / graphathon

1 stars 0 forks source link

How to brainstorm using `graphathon` #5

Open samwaseda opened 3 months ago

samwaseda commented 3 months ago

Sorry @mbruns91 for jumping between GitHub and rocket chat. I realized that the discussion should involve @liamhuber.

It's basically an extension to my previous comment, but I would like to ask a more specific question: How can we make it easier for the users to start brainstorming. When we take the example of travelling time between Berlin and Paris, we first thought of the distance between the two, but then also realized that we should include the vehicle, and so on. I think this example mapped the actual process pretty well. In particular, people usually create a workflow dynamically. The problem here is that it is currently somewhat strenuous to create temporary workflow. For example, making just one component get_distance requires at least three steps:

That's kinda much work for just drawing a workflow graph. In particular, I think we need some text representation of a workflow for this purpose, which would at least give the most basic draft.

One possibility I see is to abuse the DOT notation, and do something like this:

input -> get_distance [label=start];
input -> get_distance [label=end];
input -> get_vehicle [label=vehicle];
get_vehicle -> get_speed [label=vehicle];
get_speed -> get_time  [label=speed];
get_vehicle -> get_distance [label=vehicle];
get_distance -> get_time [label=distance];
get_time -> output [label=time];

Ok I have to admit this is also kinda rubbish, but since there are DOT parser aviailable, we could think about using it. At least I managed to make Graphviz export a figure:

graphviz-2

@mbruns91 supported more like a function-based approach, but I couldn't formulate it in a concrete manner, so I'm gonna rather ask him for details.

Other ideas?

samwaseda commented 3 months ago

Maybe to be more precise: we could then implement a snippet in graphathon, which would load the string by something like:

grapher = DotParser(my_graph, directory="nodes")
grapher.export_python_files()
print(grapher.imports)

Output:

from nodes.distance import get_distance
from nodes.vehicle import get_vehicle
from nodes.speed import get_speed
from nodes.time import get_time

Something along this line

liamhuber commented 3 months ago

I think I must be missing something fundamental to your objection, because to me this really feels like a nothing-burger.

Suppose we want to store our logic and our use-case in separate files, then in python, no matter what, we have three steps: 1) Write the logic 2) Import the logic 3) Use the logic

Three steps is just the absolute bare minimum for separating things into separate files.

Now, we could still take shortcuts. In nodes.__init__ you could have import nodes.distance; import nodes.vehicle; .... Then in the notebook we can just say import nodes and then we have a single line for (2) at the cost of (3) being slightly more verbose. At the end of the day though, if you want to split things into separate files (which IMO is stellar for git collaboration because it reduces merge conflicts), you have to go through these steps.

Another attack would be at the startup collaboration phase. When you all sit down together to talk about what the workflow should do, you could draft the nodes right in the notebook together. For example, consider a two-person team decomposing work into two stages. You could start by writing something like this directly in the notebook together, synchronously:

from dataclasses import data
from pyiron_workflow import as_function_node, as_macro_node, Workflow

@dataclass
class OurCustomDataStructure:
    field1: list[int]
    field2: dict[str: tuple[int, int, int]]

@as_macro_node("intermediate_result")
def DoPartOne(self, user_input1: int, user_input2:float) -> OurCustomDataStructure:
    return OurCustomDataStructure()  # spoof the output regardless of input

@as_macro_node("final_result")
def DoPartTwo(self, user_input3: str, intermediate: OurCustomDataStructure) -> str:
    return "spoofed_output"

wf = Workflow("our_collab")
wf.p1 = DoPartOne(42, 3.13)
wf.p2 = DoPartTwo("foobar", intermediate=wf.p1)
wf()

Then only after everyone's happy with the first draft would you move things apart into separate files and reduce the workflow to

from pyiron_workflow import Workflow

import our_nodes as on

wf = Workflow("our_collab")
wf.p1 = on.liam.DoPartOne(42, 3.13)
wf.p2 = on.sam.DoPartTwo("foobar", intermediate=wf.p1)
wf()

At least while that live, synchronous collaboration for the high-level interfaces is happening, all the mock-ups could be defined locally.

samwaseda commented 3 months ago

Another attack would be at the startup collaboration phase. When you all sit down together to talk about what the workflow should do, you could draft the nodes right in the notebook together. For example, consider a two-person team decomposing work into two stages. You could start by writing something like this directly in the notebook together, synchronously:

Yeah I know that this example works, but the reality is that it’s way too long for brainstorming. I think here your assumption is that people know what kind of nodes are needed for a workflow, but that’s generally not the case. So what people usually do is draw a graph on something like a whiteboard and map it on the code (which would be the part that you presented). With my suggestion, I would like to replace this whiteboard and mapping to the code.

In other words, the weakness of pyiron_workflow in the graphathon context is that we cannot draw a graph as long as the functions are not defined. That’s not the end of the day, but that raises the technical hurdle, because people need to know how to write nodes, and it’s a lot of writing.

This being said, whether it’s gonna be useful, I don’t know, but it appears to me like a reasonable amount of work, because there are parsers for the text-represented graphs.

samwaseda commented 3 months ago

Now the more I think about it the more I’m convinced that this feature makes an impact. Especially when I think of all the presentations where I talked about TemplateJob and PythonTemplateJob: For us developers it appears so easy because the user has to fill some functions, but hardly anyone understood it because they were some esoteric pyiron functions. Now looking at the codes you presented, I feel the same danger, because they require some basic understanding of pyiron_workflow, which makes it difficult to sell it to the PMD people. In my opinion, it would be super impactful if we say “look you just need to write a graph according to the DOT notation, and then graphathon will generate empty functions which you guys have to fill”. That’s a super simple story to absorb because no pyiron knowledge would be required in that case.

liamhuber commented 3 months ago

Ok, I think I see where you're coming from. I agree the brainstorming can be done in whatever "language" you want, including a whiteboard. At the end of the day, to represent the workflow with pyiron_workflow, you need to convert that brainstorming into node code and populate it with functionality. This step could be accelerated with a converter, e.g. DOT --> pyiron_workflow skeletons. In case you have an existing audience who is already proficient with a particular brainstorming language, doing it there and converting it instead of starting with pyiron_workflow directly is completely reasonable.

So I don't object, and if this helps you communicate with PMD folks then it's a good idea. However, before you put a bunch of work into making such a converter, I would be cautious of the following downsides:

If it's a python vs language-my-audience-likes argument, and you think a DOT->python converter, go for it. I can especially see it being useful for convincing any PIs who don't plan on actually participating in the coding themselves but may have strong feelings about preferred representations for the brainstorming. I just don't want it to be a crutch if the "important point" above is the real problem.

samwaseda commented 3 months ago
  • After brainstorming, the actual work needs to begin and at least one person per planned working group had better be familiar enough with pyiron_workflow to write the sort of bare-bones dummy nodes & workflow above -- hopefully the python/workflow syntax is thus not a barrier at all to a decent fraction of participants

That's exactly what the converter should take over - from the example that I made above, the converter should know that there must be the nodes get_distance, get_vehicle etc., and convert them to empty pyiron_workflow nodes accordingly (thus export them into files and update __init__.py). So the from the moment the DOT graph is ready, people should immediately be able to start working on their own nodes.

  • If reading the python code is really so much worse than reading DOT, we should take time to seriously consider how we can make the python (or drawn graph) more readable (THIS IS THE IMPORTANT POINT)

The point here is that reading python code is much worse than reading a graph. The fundamental problem with python code, or any programming code is that there's no intrinsic connection between different components and therefore you have to retain a lot of information. In your example, the actual workflow part is in fact super easy to read (and especially you can immediately export a graph, so it satisfies my requirements):

wf = Workflow("our_collab")
wf.p1 = DoPartOne(42, 3.13)
wf.p2 = DoPartTwo("foobar", intermediate=wf.p1)
wf()

The problem is, there's no way to represent a workflow with only this part, as you did in the snippet. You will have to write functions above, which tend to require far more lines than the actual workflow part. That makes it extremely difficult to read.

I don't think there's something fundamental we should change in pyiron_workflow. It's just the difference between a human brain and how computers work: Computers require actual numbers and well defined objects like functions, in order to compose a workflow, while we humans want a workflow based on an abstract graph, without having to think about what the actual components should look like. So from my point of view, it is totally fine that pyiron_workflow requires the function definition, but I would like to explore the possibility of making a workflow with abstract nodes. And I think DOT -> pyiron_workflow is one way to do it. As you already stated, there are also problems with type hinting and auto-complete, so I'm open to other suggestions when it comes to an actual tool to be employed, but I would still love to see a tool based on the text or graph representation of a workflow that can be converted into a pyiron_workflow workflow automatically.

liamhuber commented 3 months ago

@samwaseda, you've persuaded me this is a nice idea. At high level I still have some minor concern about this giving a false sense of security, where people get started with this and then need to actually start writing new nodes and suddenly hit a hard wall that they need to know something about the framework, but I hope this can be managed by good communication coming with the tool.

Just for shared language, I've been thinking of this as a workflow mock-up tool and will use the term "mock" in this context a bunch. (Or, "workflow mock-up developer" -- WMD (unless it's still too soon? 😝 😬 ))

Before you start writing the tool itself, I think it would be helpful to have answers to some of the following:

Here's some pseudocode for doing it entirely within python (and indeed pyiron_workflow), but with no concept of type hinting:

class MockNode:
    def __init__(self, *input_names, output_names=()):
        # Mock up IO based on args
        # Just like real nodes, can reference the entire node as long as there's
        # only one output,  otherwise we need to give output_names
        # No type hinting in this idea so far
        # Maybe the output label gets automatically updated based on node label
from pyiron_workflow import Workflow, MockNode as Mock
# Or just use Workflow.create.MockNode

wf = Workflow("brainstorming")
wf.vehicle = Mock("vehicle")
wf.distance = Mock("start", "end", wf.vehicle)
wf.speed = Mock(wf.vehicle)
wf.time = (wf.distance, wf.speed)

wf.draw()  # Just like usual
wf.compile()  # Maybe taking optional args for what goes in which file?

This comes out as 6 lines (7 including import) compared DOT's 8, because the inputs (strings) and outputs (unconnected node(s)) get to exist implicitly instead of being explicitly defined at the 1-line cost of defining them all inside the scope of a workflow. That line advantage grows with the number of inputs though.

Maybe we need a mock workflow class too and not just node, so that the core workflow class doesn't need to concern itself with knowing anything about the mocking logic.

samwaseda commented 3 months ago

I guess yours is fairly close to @mbruns91’s function idea. My concern is that it is dangerously similar to the actual pyiron_workflow, and people might not understand whether we are making a mock workflow or composing a functioning workflow. Maybe we can reach the point that the distinction doesn’t matter (although I’m a bit doubtful given the input/output tag requirement in the mock workflow), but otherwise I would prefer something that looks different enough.

I gave indeed a strong impression that I support DOT, but in reality neither I nor anyone from PMD talked about it before. I just see it as a good candidate, because it’s a text representation of a graph, and there are existing parsers.

As long as this repository is not part of pyiron, I think it’s safe to try out different possibilities, so I’m gonna make a suggestion today or maybe this week.