Possible approaches for automating the updation of the `get_info` function

Schefflera-Arboricola commented 4 months ago

Objective

Devising an efficient way to automatically update the get_info function in the _nx_parallel/__init__.py file for all PRs and/or commits(to the main branch) modifying/adding any documentation.

Current approach

Manually run script.sh to update the __init__.py and then make a PR/commit to the main branch.

Approach 1

Adding a pre-commit hook that runs script.sh and updates the __init__.py file (ref. PR https://github.com/networkx/nx-parallel/pull/55)

Issue with this approach:

if we merge a PR that updates the __init__.py then this will cause all the previously made non-doc or doc-related PRs to have merge conflicts for the _nx_parallel/__init__.py file. ref. https://github.com/networkx/nx-parallel/pull/55#issuecomment-2026804506

Approach 2

Trigger a GitHub workflow when the maintainer initiates the "merge" action on a PR, which will run the script.sh and then check if __init__.py file has changed and accordingly push a commit at the end of the PR before running all the other tests before merging.

Issue with this approach:

The merge action was not actually available in GitHub wf. (ref. https://stackoverflow.com/questions/60710209/trigger-github-actions-only-when-pr-is-merged). In the referred question one of the answers says to use :

on:
  pull_request:
    types:
      - closed

jobs:
  if_merged:
    if: github.event.pull_request.merged == true
    runs-on: ubuntu-latest
    steps:
    - run: |
        sh _nx_parallel/script.sh

But, then where will that "Update __init__.py" commit go, after script.sh is run? In the main branch? because the PR is already closed(merged) and pushing it to main will overpopulate the repository's commit history with a lot of "Update __init__.py" commits.

Approach 3

At the time of approval, the script.sh is run and the "Update __init__.py" commit is added to the PR.

Issue with this approach:

here(https://github.com/Schefflera-Arboricola/nx-parallel/pull/2) I tried to implement this. And the test is triggered only at the time of approval, so the maintainer(s) would have to make sure to approve the PR just before they merge, in case there were any doc change suggestions in the previous approval's comment.
Adding this additional commit at the end of PR could either be done by a maintainer's account or using a GitHub bot account. I'd prefer a bot account since we can use it in the future like for maintaining the timings folder, by automatically running the heatmap on the VM and adding that to the timing folder. or should we update all the heatmaps with benchmarks? I think this approach might be good but I wanted to know your thoughts before I start looking into how GitHub bots are set up.

Other approaches

Another approach could be that every time a PR gets merged a wf is triggered and an update commit is added to the main branch but that will probably overpopulate the repository's commit history with a lot of "Update init.py" commits. or have a scheduled wf that runs the script.sh and updates the get_info, like a nightly wheel. Or trigger it manually for specific PRs only.

Let me know what you think of all this and what will be the best thing to do here.

@dschult @MridulS

Thank you :)

dschult commented 4 months ago

Nice summary! Thanks --

Each time I look at this problem I feel like the solutions are "tricky" (using tricks to get around fundamental issues). I find myself wishing there was a simpler way. This time I connected those thoughts to other recent discussions like "caching". I think the goal of this effort to update the get_info function could be done without storing the info in the get_info function. We could import nx-parallel and step through each function, collecting the info we need. Instead we are trying to gather that information and "cache" it in python code within the get_info function.

Why are we building get_info as reporting stored literal string info rather than collecting that info directly by importing the package? This is similar to the question of "why are we storing the object rather than constructing it directly?" or "why are we caching?"

The advantages of caching are speed (you don't have to construct/collect anything) and sometimes memory. The disadvantages are keeping the cached version up-to-date with the primary object. In this case, we have two "sources of truth": the get_info function itself and the functions in the package. How do we keep them synchronized without causing pain like merge conflicts, knowing when it is time to update the get_info function, creating and updating a bot script.

All this makes me want to take a step back and see if there is another approach. Here are some thoughts. Do you have others?

can we just have the get_info import the library and collect the info directly in real time? What are the disadvatanges of this approach?
If we stored the two packages in two separate repositories, would that help? _nx_parallel could get updated without changes occurring in the nx_parallel package. Would separation help here?
When is get_info used? If it is only used when building the networkx docs, the benefits of caching the info is not as important because building the published docs happens less often than importing nx_parallel.
How do nx-cugraph and graphblas-algorithms handle their get_info functions?

eriknw commented 4 months ago

How do nx-cugraph and graphblas-algorithms handle their get_info functions?

nx-cugraph automates this:

all dispatched algorithms are decorated with @networkx_algorithm that collects metadata and adds the function to the backend interface (here). get_info uses extra_doc and extra_params from networkx_algorithm objects.
our get_info function (here) is updated via make plugin-info
- See Makefile
- See update script that gets run
this gets run as a pre-commit hook
- previously (i.e., before this pre-commit hook), we ran the update scripts in CI and would fail CI if it changed anything (git diff --exit-code)

This has worked well for us, and setting up automation has been well worth the effort. Merge conflicts resulting from multiple PRs that touch the file have been trivial to handle--simply checkout the file that has the generated data, then rerun the script.

CC @rlratzel who may be interested in this discussion

graphblas-algorithms does not yet automate this and would be interested in using a shared "recommended" approach.

Schefflera-Arboricola commented 4 months ago

I'm not sure how caching could help here because there is no input to the get_info function. We import the nx_parallel package and then extract the docstrings of all the algorithms in it and do string manipulation stuff to make the function dictionary.

can we just have the get_info import the library and collect the info directly in real-time? What are the disadvantages of this approach?

The reason why the nx-parallel docs were not appearing in the additional backend section under the function's docs page was that get_info was extracting docstrings in real time and that was giving that circular import error. So to resolve the issue all I did was copy-paste the output of get_info and make it an independent function. and also I moved it in _nx_parallel from nx_parallel folder. I think @eriknw can explain it better, why entry points can't do real-time extraction of docstrings. Is it just so that we don't have to import the whole backend? or is there something else too?

How do nx-cugraph and graphblas-algorithms handle their get_info functions?

Thank you very much @eriknw for answering this! I'll look into how nx-cugraph does it and ask if I have any questions :)

Thanks!

dschult commented 4 months ago

Thanks @eriknw for that great summary -- and for the links to the code. I had found much of that but not all of it. I had not explored which parts are done in pre-commit at all.

It seems like having it in pre-commit is a convenience, but if someone makes a PR and they don't have pre-commit installed the pre-commit will not run. So don't we also need to have a CI check that the get_info function is accurate?

dschult commented 4 months ago

@Schefflera-Arboricola, there seems to be a difference between importing the library and getting the info using python's inspect tools vs walking the repo and getting the info by pulling it out of the python code/doc_strings. I guess the inspect tools get the info as python sees it after importing the function, while walking the repo gets it as it is specified in the code. Those should normally be the same, but might be different if some mutation of the info occurs during the import process.

Do you have any intuition about which approach is better, or a reason to choose one over the other?

networkx / nx-parallel