networkx / nx-parallel

A networkx backend that uses joblib to run graph algorithms in parallel.
BSD 3-Clause "New" or "Revised" License
23 stars 20 forks source link

Adding a `pre-commit` hook to update `get_info` #55

Closed Schefflera-Arboricola closed 3 weeks ago

Schefflera-Arboricola commented 5 months ago

What does this PR currently do?

Adds a pre-commit hook that updates the dictionary returned by the get_info function.

What this PR initially did?

Solves the issue of nx-parallel's info not appearing in the "Additional backend implementation" box(ref. ss):

Screenshot 2024-03-22 at 1 15 56 AM

WIP : running pre-commit updates the get_info but it also fails every time

dschult commented 5 months ago

Yay for getting the doc_strings to show in the NX documentation!

Does it fail because it changes the files? If so, after running it once the files should be changed -- so git add and then try to commit again should work. I guess that is an extra step -- but does it work?

Schefflera-Arboricola commented 5 months ago

the hook updates the __init__.py every time the pre-commit is run but we get into the loop of “git add .” and “pre-commit”, and the ruff-format and update function info hooks fail alternatively, as shown below:

(base) aditi@Aditis-MacBook-Air nx-parallel % pre-commit
Update function info.....................................................Passed
blacken-docs.............................................................Passed
prettier.................................................................Passed
ruff.....................................................................Passed
ruff-format..............................................................Failed
- hook id: ruff-format
- files were modified by this hook
1 file reformatted
(base) aditi@Aditis-MacBook-Air nx-parallel % git add . 
(base) aditi@Aditis-MacBook-Air nx-parallel % pre-commit
Update function info.....................................................Failed
- hook id: update-get_info
- files were modified by this hook
blacken-docs.........................................(no files to check)Skipped
prettier.................................................................Passed
ruff.................................................(no files to check)Skipped
ruff-format..........................................(no files to check)Skipped
(base) aditi@Aditis-MacBook-Air nx-parallel % 

And running black on init.py just after the update in the same hook, didn't seem to resolve this. A solution to this might be to have a bash script that only updates the get_info when there’s any actual difference.

dschult commented 5 months ago

Yes, you have a loop: 1) update-get_info writes code that is not formatted nicely. 2) ruff formats that code (though it still is not formatted with triple quotes and line-breaks to make it more readable)

Each makes the file init.py look different. So one of the two will not be satisfied.

Looks like you could make update-get_info create a well formatted file. Or you could tell ruff to ignore that file (but then the file isn't human readable).

A kind of a cheat is to have update-get_info run ruff against the __init__.py file after writing to it. I have push that up to this branch. I don't know a good way to use python to write python code that is formatted well. We would need a way to run ruff from within python on a string instead of running it on an existing file.

Schefflera-Arboricola commented 5 months ago

I added a script that would update the __init__.py only if there are any updates. However, there is a problem that repo: local in the pre-commit hook will refer to the main branch of the main nx-parallel repository when the lint workflow is being executed in gh workflows and not the PR's branch of the forked repo. So, if someone makes some changes to the docs and then runs their pre-commit locally and sees it's passing for all hooks, and then they will submit a PR and their style/format test would fail here on GitHub(under the PR) because pre-commit is not running on their PR's branch(it's running on nx-parallel's main). One way to resolve this might be to clone/download the forked repo in lint.yml and then run pre-commit on that. But, to clone the contributor's fork every time lint workflow tests are run doesn't seem very efficient to me.

Also, I think if we would want to merge a non-upto-date branch(that doesn't even update any docs) in the future then we might face unnecessary merge conflicts in the __init__.py file. So, I think I need to think about this a bit more, like, if we should even put update-get-info in pre-commit. For now, I've created a separate PR that adds the script to update the get_info and for other changes. (@dschult )And the ideal flow for merging/reviewing them would be:

PR https://github.com/networkx/nx-parallel/pull/56 ------(rerun the update script in PR https://github.com/networkx/nx-parallel/pull/57 )------> PR https://github.com/networkx/nx-parallel/pull/57 ------> PR https://github.com/networkx/nx-parallel/pull/55

dschult commented 4 months ago

I've been thinking more about how to make this simple -- and yet still work. After some playing around I was able to get your script to create python code that ruff accepts without changes. The approach created indented dict structures in text form. So rather than convert the whole func dict in one str function, we can construct a string inside get_funcs_info with human-readable indentation and use that to create the get_info function.

That would short-cut the cycle of update_get_info -- ruff -- update_get_info...

Your comment about having lots of merge conflicts is certainly true. Though there might not be as many as you think because any new functions or doc_string changes would occur in a different place in the file. So only if the changes happened to be adjacent would there be a merge conflict.

I'm think I will go ahead and reopen this PR and try the changes as follows: Replace return funcs with

    indent = "\n" + " " * 12
    out = "{"
    for func, finfo in funcs.items():
        out += indent + f'"{func}": {{' + indent + f'    "url": "{finfo["url"]}",'
        out += indent + f'    "additional_docs": "{finfo["additional_docs"]}",'
        out += (
            indent
            + f'    "additional_parameters": """{finfo["additional_parameters"]}""",'
        )
        out += indent + "},"
    out += "\n        },\n    }\n"
    return out

And then at the end of the file, take out the close bracket, and no str needed:

    f.write(string + get_funcs_info())
Schefflera-Arboricola commented 4 months ago

What have I done?

1. testing

Added a blah blah blah to docstring for testing. And I kept adding/deleting this in further commits for testing simultaneously.

2. rm script.sh

removed the script because it only had 2 lines and I joined the 2 commands with && and directly passed in the entry parameter

entry: python _nx_parallel/update_get_info.py && git add _nx_parallel/__init__.py

but this was failing the style error(ref. https://github.com/networkx/nx-parallel/actions/runs/9014305540/job/24766731863?pr=55). I think maybe because after running the first command a file was getting modified, so the pre-commit hook failed before running the git add command.

3. added the updated script and rm string manipulation

I changed the entry point to the script.sh again and removed the string manipulation used to add indentation and instead formatted the __init__.py file with black and ruff format before doing git add. I was trying to see if we can approach this without manually doing all the styling and if we could directly use ruff and black. Because then we won't have to change our code if the black or ruff gets updated. The script(at this stage):

python _nx_parallel/update_get_info.py

# Check if the __init__.py file has been modified
if [[ $(git status --porcelain _nx_parallel/__init__.py) ]]; then
    black _nx_parallel/__init__.py
    ruff format _nx_parallel/__init__.py
    git add _nx_parallel/__init__.py
fi

4. styling

this line(in the docstring extraction code) was removed for some reason and i thought maybe that was failing the style test now on gt wf, so i added it back. Line: par_docs = par_docs.replace("\n", " "). but it was still failing so I decided to remove the if condition in the next commit...

5. updated script.sh

..but this didn't work either.

6. rm git add from script

I wasn't sure if we should do git add in the pre-commit. because my understanding was that it is done after we run pre-commit. so I removed it but then I later added it(in commit/point 9) to get rid of the style test failure about having a modified file.

7. added pip install to script

added pip install black ruff in the script.sh because getting error something like black not found, ruff not found.

8. update update_get_info.py

fixed the indentation of the line I added in commit(point) 4.

9. added git add to script

added git add _nx_parallel/__init__.py as The hook must exit nonzero on failure or modify files.(ref. doc

10. rm blah

removed the testing blah

11. testing

added blah in the docstring for testing and didn't run pre-commit locally so the __init__.py file wasn't updated and here in gh wf the style test is passing when it should be failing(so we should not do git add _nx_parallel/__init__.py at the end, but when we don't we fail the pre-commit hook as soon the file gets updated(before we can style it).). so my next thought was to add back your(@dschult ) indentation code and remove the git add and then test it. which I did in the next 2 commits.

12. added the indentation code back and rm git add

… but removing git add was failing the style test(ref. https://github.com/networkx/nx-parallel/actions/runs/9016157272/job/24772188632?pr=55)

13. rm blah - testing - no pre-commit locally

self-explanatory; this is failing but not for the reason it should fail.

Thank you!

dschult commented 4 months ago

but this was failing the style error(ref. https://github.com/networkx/nx-parallel/actions/runs/9014305540/job/24766731863?pr=55). I think maybe because after running the first command a file was getting modified, so the pre-commit hook failed before running the git add command.

Maybe connecting the two commands with || instead of && would work then? The second will be run only when the first is not successful; which I think means it is only run when the files change. Would this work?

I wasn't sure if we should do git add in the pre-commit. because my understanding was that it is done after we run pre-commit. so I removed it but then I later added it(in commit/point 9) to get rid of the style test failure about having a modified file.

I think the pre-commit is run when git commit is run -- so it is OK to use git add within a pre-commit entry. Also, I don't think pre-commit is run by github, but I could be wrong. The github actions are run, but they don't use pre-commit. So changing something in pre-commit and then not running pre-commit locally and pushing it up shouldn't trigger a pre-commit run. Again, I could be wrong about this. Does it align with what you are seeing?

Schefflera-Arboricola commented 4 months ago

I think the pre-commit is run when git commit is run

No, and the style gihub wf runs pre-commit, ref. https://github.com/networkx/nx-parallel/blob/b07b89a427f8f955b4aa3fc864137ae8106505c3/.github/workflows/lint.yml#L30

dschult commented 4 months ago

Got it -- that makes me feel better about the whole pre-commit dependence. We are checking it after it arrives at github too. :) Thanks for pointing me to that check in the style CI -- I should have noticed it already. :)

dschult commented 1 month ago

I'm looking to review this PR about using pre-commit to update get_info. The PR diff looks very simple. There doesn't seem to be any change in the script file. The only substantial change I can see is that the get_funcs_info now sorts by the function name.

Can you describe what this PR does and how it helps update the pre-commit while avoiding the circular conflicts we were having before?

Schefflera-Arboricola commented 1 month ago

Recently, when I was revisiting this PR, I observed that the style test was failing on Github(but passing locally) because, for some reason, the ordering of the functions in the get_info's returned dictionary was not the same for my local machine and for the GitHub CI here. I thought it was because of some version inconsistency, or the ordering of the pre-commit hooks, or any recent updates in pre-commit, ruff, black or python. But none of these were the case, as indicated by the commits above. However, I ended up adding the versions explicitly for all the libraries. 

To figure this out, I dug a layer deeper and tried to see if there was any inconsistency in the update_get_info.py script. And I quickly realised that the ordering of the functions was differing because the order in which my machine and the order in which GitHub CI were extracting the files and functions from the nx_parallel module were different. So, I first tried sorting the files extracted and then the functions in those files, but still it was failing because some files have the same names, like connectivity.py, which exists in both the approximation and the connectivity modules. So, I decided to sort the returned function dictionary instead.

Please LMK if you have any further questions. Thank you :)

Schefflera-Arboricola commented 3 weeks ago

@dschult Please LMK if it would be possible to get this PR also reviewed and merged before GSoC ends(on August 26th) based on your schedule, so that I can accordingly mention it in my GSoC final report. I'm open to scheduling a meeting to discuss this PR if that would help you review this better and faster.

Thank you :)

dschult commented 3 weeks ago

Yes -- let's get this PR finished up and merged this week.

It looks like it works now (with the function names being sorted). And it appears to work on CI too. Is there more to do, or is it ready to merge as far as you are concerned?

Some questions: Did you consider using Python's inspect features rather than walking through the source tree? It has functions like inspect.getdoc and tools like for mod in inspect.getmembers(nx_parallel, inspect.ismodule):. I'm not sure of the tradeoffs between walking the source tree and walking the imported module. Just wondering if you had tried it and decided the source tree is better.

:)

Schefflera-Arboricola commented 3 weeks ago

It looks like it works now (with the function names being sorted). And it appears to work on CI too. Is there more to do, or is it ready to merge as far as you are concerned?

Looks ready to me :)

Some questions: Did you consider using Python's inspect features rather than walking through the source tree? It has functions like inspect.getdoc and tools like for mod in inspect.getmembers(nx_parallel, inspect.ismodule):. I'm not sure of the tradeoffs between walking the source tree and walking the imported module. Just wondering if you had tried it and decided the source tree is better.

I don't think I tried using inspect or anything else when I added the update_get_info.py script in (I think) PR https://github.com/networkx/nx-parallel/pull/53 . So, I'm also not sure which one is a better choice but I don't think this is a blocker here and we should probably handle this in a separate PR. But, it's good to have this noted somewhere. LMK what you think.

Schefflera-Arboricola commented 3 weeks ago

We probably don't want the version numbers of pre-commit and ruff to be pinned, right? And do we want to use python3 instead of python in script.sh?

All of this was made explicit in the process of trying to figure out what exactly was causing the inconsistency in the order of the functions in get_info. And, I decided to keep them like this because explicit is better than implicit 😅