mbruns91 / graphathon

0 stars 0 forks source link

Neat features to be added #3

Open mbruns91 opened 4 weeks ago

mbruns91 commented 4 weeks ago

Here, I'll just list the stuff I would like to add as functionalities. This list is likely to grow over the next hours and days.

mbruns91 commented 3 weeks ago

BTW: if you have any cool idea, @samwaseda or @liamhuber , I'd suggest this is the place to tell me!

One thing I would really like to do in the future is to not at all rely on github actions for this and make it a more standalone solution. I just haven't found the right tool or platform yet.

samwaseda commented 3 weeks ago

Allow surpression of triggering the CI on a per-push basis

I'm not really sure if I understand this correctly. You mean like the label solution that you implemented now?

samwaseda commented 3 weeks ago

I don't know how to solve the problem, but one thing I noticed while creating the empty workflow is the fact that there's a lot of copy and paste with no particular benefit. In the current example, I had to write the same function three times: 1. in the workflow; 2. in the import statement; 3. in the python file. Ideally 2 and 3 should be automatically done once the workflow is written, but I don't have a clear idea how it could be done.

mbruns91 commented 3 weeks ago

Allow surpression of triggering the CI on a per-push basis

I'm not really sure if I understand this correctly. You mean like the label solution that you implemented now?

I mean more something like that you can specify a flag or something like that when doing git push so that the ci is not triggered. This way you don't have to remove the label first (I would 100% forget to do that).

samwaseda commented 3 weeks ago

I mean more something like that you can specify a flag or something like that when doing git push the ci is not triggered. THis way you don't have to remove the label first (I would 100% forget to da that).

Ah I see. Is there a possibility to kill jobs? That might be also useful when someone pushes by mistake

liamhuber commented 3 weeks ago

I don't know how to solve the problem, but one thing I noticed while creating the empty workflow is the fact that there's a lot of copy and paste with no particular benefit. In the current example, I had to write the same function three times: 1. in the workflow; 2. in the import statement; 3. in the python file. Ideally 2 and 3 should be automatically done once the workflow is written, but I don't have a clear idea how it could be done.

You mean like you had to write the name of the function three times? By adding the submodules to the parent __init__.py we can reduce one of these, so the final invocation is a bit longer and looks like nodes.speed.get_speed or whatever.

This concern seems pretty fundamental to the programming language though? Like if we define something in one module and use it in another there is a floor on how few times we can write the things name.

At a high level, I would encourage use of this repo to minimize the repetition that you're concerned about -- the workflow at the notebook level should be composed of as few nodes as possible to make sense/distribute them across the participants. In any realistic situation, the current workflow nodes are almost certainly all going to be macros, and it's reasonable that these macros themselves contain macros (and so on to whatever depth is useful for the problem).

Basically, good macro composition should reduce re-writing the same node name all over the place, but any reused of a node will require a bare minimum of two statements (definition and use); outside the definition module, whether we want to have a third statement for a direct import or to rely on more indirect pathing to clump a bunch of imports together is totally situational.

liamhuber commented 3 weeks ago

I mean more something like that you can specify a flag or something like that when doing git push the ci is not triggered. THis way you don't have to remove the label first (I would 100% forget to da that).

Ah I see. Is there a possibility to kill jobs? That might be also useful when someone pushes by mistake

If it's a github workflow, I'm pretty sure it can be manually cancelled from the actions menu. Might be neat to be able to do it via labels too, although this would require waiting for the label workflow to trigger and rerun, etc, so depending how fast/slow the notebook is it may or may not be effective.

More generally, a bunch of this stuff could maybe be triggered by bot commands? Like @graphathonbot run workflow @graphathonbot stop workflow @graphathonbot render latest workflow (show it in a comment) etc

samwaseda commented 3 weeks ago

Yeah bot might be the right solution. I know that @editorialbot from JOSS makes .pdf from .md and posts it in a comment. I'm sure that something similar can be done here.

samwaseda commented 3 weeks ago

I just had a super inspiring discussion with @jan-janssen, but before I start talking about it, I want to suggest to keep this issue page open for brainstorming, but as soon as some ideas become more solid, we should open a separate issue to make objectives clearer.

samwaseda commented 3 weeks ago

Coming back to the discussion with @jan-janssen: On GitHub we obviously have the problem that we may or may not be able to run the workflow, because some of them could be extremely long. In order to avoid this problem, we should make use of pack and unpack, so that each user has the possibility to upload the nodes and the results obtained in their local environment. Here, it is crucial to be able to do the following:

With this, the nodes and their inputs and outputs will be provided by individual users, and Git is responsible for making sure that the workflow runs, even if there's a possibility that the inputs and outputs are not compatible across different nodes (because the results are obtained by different users).

mbruns91 commented 3 weeks ago

I mean more something like that you can specify a flag or something like that when doing git push the ci is not triggered. THis way you don't have to remove the label first (I would 100% forget to da that).

Ah I see. Is there a possibility to kill jobs? That might be also useful when someone pushes by mistake

Yes, already running jobs can be killed in the github UI 👍

liamhuber commented 3 weeks ago

@samwaseda, this is a really good point and I think we're already very close to this. Already the workflows (and other nodes) will check to see if they have a commensurate save file (using tinybase h5), and if so they load themselves. Alternatively -- and I plan for this to be the main way until a universal h5 solution exists -- you can just pickle the workflow/node, and manually unpickle it later. In both cases, if you run the notebooks, save it in either format, and push that to github, then when I pull it down to my branch and load it, pickle/h5 should be instantiating the most up-to-date versions of my class for my nodes while still getting all the IO data from your latest run.

The missing thing to make this work is an already-planned feature: the ability to hash input and return existing output when run is called while the current input hash matches the last-used hash. The plan is for this to be opt-in, but it would allow github to jump over any nodes we're flagging as heavy.

The final catch is that when I load your saved workflow, if I want to test my new node, I'll probably need to reset its input hash to make sure it actually runs! I guess this can be made easy, like

wf = Workflow("calculate_travel_time")
wf.speed.clear_hash()
wf()
wf.save()

The catch is that I'd want to be playing around like that in some gitignored notebooks, while the shared notebook needs to look something like

wf = Workflow("calculate_travel_time")
if len(wf) == 0:  # nothing loaded
    wf.vehicle = get_vehicle(vehicle)
    wf.distance = get_distance(start, end, wf.vehicle)
    wf.speed = get_speed(wf.vehicle)
    wf.time = get_time(wf.distance, wf.speed)
wf()

or similar.

  • The data contain only the results, the best is only inputs and outputs, but potentially just the hdf5 file

I guess the data we want to push is the workflow savefile, whether that is h5 or pickle, or whatever

  • Uploading should be done automatically from the notebook, meaning we need the python interface of Git

Possibly, but maybe having the workflow savefile added to the git tracking will be sufficient. Only big pain I see is merge conflicts and having to force-accept the remote version of the save file.

Again, graphathon "knowning" might be as simple as adding a single savefile to the git tracking! To avoid this setup we could even already have an empty save file of the correct name in the repo, so it's already tracked when people use the template/clone the downstream repo.

The only big catch I see here, and I think it applies to basically any formulation, is that if the node is too slow for github to run, there is some chance that it outputs too much data for github to hold. This will become an issue for the more straightforward "just upload the workflow savefile" (which stores IO down in subgraphs) than it would for a more sophisticated approach where we're really explicitly pushing/pulling only the IO of these most expensive nodes -- but even that just buys us more freedom and does not fundamentally solve the concern unless you're very careful that the highest-level nodes have super simple output.

The input-hashing-to-avoid-rerun is planned soon -- next week if I get to it, otherwise August. Black/whitelisting what data gets stored when you wf.save() is already an open issue over on pyiron_workflow, but unlike the hashing I have no concrete timeline to get to it -- for the foreseeable future the plan is to dump all the data on save.