Closed eriknw closed 1 year ago
Hi @eriknw!
We're very glad to see you all have gone ahead with a full submission, after discussion in https://github.com/pyOpenSci/python-package-guide/issues/21#issuecomment-1368046000 as you linked above.
I just want to welcome you and let you know we are working on this.
I will finish the initial editor checks by the end of this week. @lwasser is traveling and we will need to co-ordinate about editors, but I expect to get back to you about that by the middle of next week at the latest.
Thank you for providing all the detail in the submission. The context you've provided will be helpful for the review (and will definitely help me with editor checks!). It sounds like you've anticipated some points @lwasser brought up when we discussed in Slack. Looking forward to helping you improve the docs and giving you some blog post material! :grin:
These are the basic checks that the package needs to pass to begin review. Please check our Python packaging guide for more information on the elements below.
import package-name
.pip install python-graphbals
in a python 3.10 venv on LinuxREADME.md
file with clear explanation of what the package does, instructions on how to install it, and a link to development instructions.CONTRIBUTING.md
file that details how to install and contribute to the package.Code of Conduct
file.YAML
header of the issue (located at the top of the issue template).This package passes all checks; we can begin review.
It's obvious the developers and maintainers have done a ton of work.
Our goal here should be to help make sure everyone can appreciate how much work they have done, and how much functionality is packed into python-graphblas.
Along those lines, some notes for the review (none of these need to be addressed before we start):
@eriknw @jim22k, @SultanOrazbayev the tl;dr is that python-graphblas passed editor checks 🙂 🎉
Like I said above, I'll need to co-ordinate with @lwasser who is traveling about an editor for this review, but would expect us to reply back here by middle of next week at the latest
Hooray! 🎉
Thanks @NickleDave. We appreciate the attention y'all are giving us, and thanks for telling us what (and when) to expect next. We're in no particular rush--it's more important to give the right people the right amount of time to do things right :)
it's more important to give the right people the right amount of time to do things right
🙌
finding the right people now! 🙂
Hi again @eriknw, @jim22k, @SultanOrazbayev -- just letting you know that I did have a chance to talk with @lwasser now that she has returned, and we are in the process of finding an editor
When you have a chance could you please (all) fill out the pre-review survey?
It's here: https://forms.gle/F9mou7S3jhe8DMJ16
We appreciate each maintainer of the package filling out this survey individually. 🙌 Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌
(I know it's easy to miss in the template)
Hi @eriknw, @jim22k, @SultanOrazbayev, brief update:
very happy to inform you that @tomalrussell will be guest editor for this review! 🎉 NetworkX contributor, developer of spatial & network tools like snkit.
I will let @tomalrussell take it from here!
Hi @eriknw, @jim22k, @SultanOrazbayev, and thanks to @NickleDave for the introduction.
I've reached out to potential reviewers, and incidentally look forward to taking a closer look at python-graphblas
myself. I'll update here as soon as, definitely within a week.
:wave: Hi @sneakers-the-rat and @szhorvat! Thank you for volunteering to review for pyOpenSci!
The following resources will help you complete your review:
Please get in touch with any questions or concerns! Your review is due in three weeks: 29th March 2023
Hello everyone 👋
I'm excited to do this review and learn more about the GraphBLAS approach in general.
I plan do the review gradually, and through continuous communication with the authors. I will make it clear when I consider the review to be completed. Feel free to respond to anything I might bring up before then. The same applies to the review checklist: I will post it below today, and will check off boxes gradually.
Any issues I open for python-graphblas will have a title prefixed with [pyos]
and a link back here. You'll probably see me in the discussion forum as well, as I will likely need a bit of help while trying to solve a few toy problems with the library.
Expect comments mostly on mathematical aspects, correctness, docs, and usability from me. Hopefully other reviewers will cover the more technical aspects of Python.
I will aim to complete the review by April 2nd. Since the authors will have the opportunity to address concerns before then, I hope this little delay over the 3 week deadline will be fine. I will not be available during the week of the 20th.
Let me know if you'd like any changes to this arrangement—I can be flexible.
For transparency, I should note that am involved with the igraph project (https://igraph.org/). igraph is not a competitor to python-graphblas, but it does have similar aims to NetworkX and in extension to graphblas-algorithms. I think this review will be a good opportunity for us to learn form each other.
Review is now complete.
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.
The package contains a paper.md
matching JOSS's requirements with:
@tomalrussell Could you please clarify if the authors are expected to follow all the above checklist items to the letter? (E.g., all mentioned badges mandatory, links to all "vignettes" from the README, all functions have examples of use, etc.)
@eriknw :
includes documentation with examples for all functions.
- We're working on this! The C API and SuiteSparse:GraphBLAS C library are both well documented. We have a very large API surface area to cover, so "documentation with examples for all functions" is a really, really high bar, but one I hope we achieve someday :)
Indeed, the documentation of most Python projects won't give usage examples for every single function, and doing so would definitely be a lot of work. But just to show that it is possible, and often tremendously useful to users, I wanted to point to Mathematica's documentation where each function has not one, but many examples. See e.g. LinearSolve. I tried to follow the same with my IGraph/M package for Mathematica, but it's still a work in progress. R packages also often have at least one example for each function.
Indeed, the documentation of most Python projects won't give usage examples for every single function, and doing so would definitely be a lot of work.
From what I recall we had a conversation at some point about allowing the author to define what is intended as the public interface of the package and what isn't? but ya for packages that wrap another library it seems like a lot of extra work if, eg. there are examples from the main library that are trivially different (ie. could be inferred) from the wrapper's API.
I'll also be doing this JOSS-style, leaving this here and editing/raising issues as I go. I like @szhorvat 's idea of using an issue tag, so i'll also prefix mine with [pyos]
.
I'm happy to focus on more of the python implementation side of things, glad to have someone who's more adept with the math :).
I don't have any conflicts to declare, except that I'm going to be writing some triplet store code soon, but that doesn't really relate or create a conflict imo.
I don't have an expected completion date, but i have this in my calendar as a daily todo item and welcome being relentessly pinged if i am the one holding us up :)
Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide
The package includes all the following forms of documentation:
pyproject.toml
file or elsewhere.Readme file requirements The package meets the readme requirements below:
The README should include, from top to bottom:
NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)
notebooks
folder in the repo root) in the README.
CITATION
file: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-filesReviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:
In trying to summarize what the functional claims of the software were, I think it basically boils down to
[x] Wraps GraphBLAS - the software does indeed wrap the graphBLAS API, so mission accomplished there as far as I can tell. I did not to an exhaustive 1:1 of graphblas API functionality and whether it's wrapped by the package, but don't think that's necessarily the most important thing to do. Not exactly sure how the version of the API is handled, so opened an issue: https://github.com/python-graphblas/python-graphblas/issues/424 and another on the structure of the code: https://github.com/python-graphblas/python-graphblas/issues/425
[x] Collections - https://python-graphblas.readthedocs.io/en/stable/user_guide/collections.html - the tests (test_matrix
, test_vector
, test_scalar
) test the functionality I would expect here.
[x] dtypes - https://python-graphblas.readthedocs.io/en/stable/user_guide/types.html - dtypes are all mapped to numpy, numba, and ctypes. tests are also good here.
[x] operators - after https://github.com/python-graphblas/python-graphblas/issues/429 I understand the implementation better and think the developers know how to improve the documentation to make this easier for future ppl to figure out. I like the descriptions of the different operators in the docs, make it pretty clear what's going on. would recommend linking out to examples here. That would be easiest to do by putting the examples in the operator docstrings so they can be defined in one place - "see the class documentation for this operator" which you can then be confident has examples/etc. without needing to point to a bunch of different places for different kinds of documentation. I like the care that is taken with the infix operators and the op
namespace.
[x] operations - some issues with examples in the documentation: https://github.com/python-graphblas/python-graphblas/issues/434 Otherwise looks good
[x] graphblas initialization - works as expected
[x] i/o - networkx, scipy.sparse, pydata sparse, numpy all work as expected. awkward
is undocumented ( https://github.com/python-graphblas/python-graphblas/issues/436 ) . All these could really benefit from type annotations since they're traversing types ( https://github.com/python-graphblas/python-graphblas/issues/435 ). Otherwise looks good.
[x] user defined functions - seems to work. raise issue about documenting params ( https://github.com/python-graphblas/python-graphblas/issues/437 )
[x] recorder - this is sick as hell, nice work y'all. I don't know enough about graphBLAS to evaluate the results of the recorder, but it works and is a great addition to the package.
[x] Performance: Any performance claims of the software been confirmed. - the performance claims are very minimal, eg in the FAQ and seem quite reasonable.
[x] Automated tests: - nicely done on these.
[x] Continuous Integration: Has continuous integration setup (We suggest using Github actions but any CI platform is acceptable for review)
[x] Packaging guidelines: The package conforms to the pyOpenSci packaging guidelines. A few notable highlights to look at:
N/A
Estimated hours spent reviewing: (final estimate) ~14h
python-graphblas
is an exemplary package among scientific python packages, with excellent docs, tests, code quality, examples, and contributor engagement. The package is an opinionated wrapper and interface to GraphBLAS, which is well justified and differentiated from prior wrappers. Throughout the review process, the authors have been receptive to feedback and I have all faith that they will address any continuing suggestions I have in future development work. I have no hesitation recommending this package to anyone who wants to use GraphBLAS from Python.
My outstanding recommendations in the remaining open issues are all future suggestions that the authors can take or leave, none are mission-critical. I want to compliment the authors on this excellent work, I'm glad to have had a reason to have read it. I would be happy to respond to any questions the authors have about this review and otherwise continue to engage on open issues.
Library wrappers have their own sets of challenges and idioms, and python-graphblas
handles them well with some room for future improvement. Wrapping code needs to make decisions about how to map the underlying API between languages and interact with the C library. The strategy used to abstract and expose the underlying library API can have large impacts on the maintainability and readability of the code, as can strategies for keeping the version of the wrapping package in sync with the changing API of the wrapped library.
python-graphblas
further complicates these choices by admirably taking on the challenge of having multiple backend implementations (numpy and SuiteSparse:GraphBlas), and multiple export formats (NetworkX, Scipy.Sparse, PyData.Sparse, numpy, SuiteSparse, Matrix Market).
The python-graphblas
team has chosen a bifurcated abstraction style which fits with the design of their wrapper. Rather than a transparent wrapper, the package introduces its own calling syntax to logically separate the i/o parts of a GraphBLAS call from the computation, which is well described and justified in the documentation.
At the time of reviewing,
core
module (eg core.vector.Vector
) graphblas.unary
)
core.operator
module, which then makes the symbols available in the top-level module namespace as a side effect of importing, eg. (graphblas.unary.ss
)core
module also implements the other syntax elements like infix operators that are used throughout the package. This structure makes for a nice user-facing interface, being able to refer to classes and methods in a natural way (eg. gb.Vector
, gb.unary.abs
), obscuring the underlying code structure.
The tradeoff is the significant amount of indirection and implicitness in the code which presents a relatively high barrier to entry for new contributors. The mappings between the suitesparse GraphBLAS implementation and Python are programmed in several places, eg. within a method and then again in the tests, which the authors describe as being useful as a consistency check. Having some parts of the library dependent on import time side-effects is less than optimal and makes the code more difficult to reason about, but is certainly not fatal to the usability of the package. Again, the nature of a wrapping package requires making decisions about abstraction, so the only concern I have for the current code structure is the impact on maintainability. The current maintainers seem to have no trouble reasoning about the package, and I believe they are aware of these challenges and are actively working on them (they refactored a formerly massive operator
file during the course of the review (https://github.com/python-graphblas/python-graphblas/pull/420)). The maintainers can do themselves and future contributors a favor by writing some additional developer docs that explain the structure of the library in greater detail, and I believe they will!
The other question I had was about the relationship between versioning in the API implementation in Python and in the underlying C library. GraphBLAS is a relatively mature API and seems to be only rarely changed, with care for backwards compatibility, so this is less of a concern than for a wrapper around a more actively evolving API. The authors have chosen to formally support a specific version of GraphBLAS ( https://github.com/python-graphblas/python-graphblas/pull/441 ) rather than build version-compatibility infrastructure within the package, which seems like an appropriate decision to me, given their other comments on how refactoring their backend system is on their development roadmap.
I want to also emphasize several of the "nice to have" features in the package, including a very impressive Recorder class that can keep a record of all the GraphBLAS calls made for debugging and reproducibility, automatic compilation of user-supplied functions including lambdas using numba, and the excellent i/o module. These indicate the authors are actively invested in user experience above and beyond the tidy API they expose.
Aside from the above comments, the fundamentals of the package design are strong: modern packaging with setuptools, automated code quality checks, CI, and coverage reports. The decisions made in code structure seem principled, responsive to the constraints of the problem, and result in a very usable user interface - my compliments to the Authors.
The package is well documented from an introduction to the problem that GraphBLAS attempts to solve through package design decisions and practical examples. It is suitable for a general audience with some exposure to graph operations and programming in python, which is impressive given the highly specialized nature of the library.
Future suggestions for the authors include embedding their example notebooks in the documentation, and improving the API documentation. Currently, I assume due to some of the abstraction decisions made in the rest of the package, there is not comprehensive documentation of every operation available to the user that might otherwise be accomplished with calling autodoc
over a class or a module, but some common operators are listed here and the list of available operators are present in the GraphBLAS documentation as well as through object inspection within an interactive Python session. Given the nature of wrapping code where the underlying operations are well documented elsewhere, this is less of a problem than it would be in other packages.
Altogether the docs are excellent with several clear points of improvement, but far above average in the landscape of scientific python packages.
Who among us can claim 100% test coverage? https://coveralls.io/github/python-graphblas/python-graphblas
The tests are well organized and comprehensive, and I was able to find a corresponding test for every package feature I looked at easily. In more security-sensitive contexts one would want to do more adversarial input fuzzing, but I don't think that's all that relevant here since I've never seen graph data analysis libraries used as a malware vector. I have no notes on the tests, this is good work.
Thanks both, great start 💯 - I'll aim to check in occasionally or as needed.
Could you please clarify if the authors are expected to follow all the above checklist items to the letter? (E.g., all mentioned badges mandatory, links to all "vignettes" from the README, all functions have examples of use, etc.)
I would comment on anything you notice and let the authors respond, we can always exercise judgment if "to the letter" seems unhelpful.
hey y'all. We normally prefer that reviews happen "all at once" in the sense that we're prefer the text of the review to NOT change once submitted and the conversation to happen after. I have reached out to JOSS about how the implement their reviews but I don't want to change our policy on how reviews happen (submitted all at once) until i've spoken to JOSS. @sneakers-the-rat @szhorvat it's fine if you want to leave the review text and check things off and open issues as you go but i prefer that the text of the review that you add to be added all at once to avoid any confusion regarding when you review is complete and what the maintainer of the package should focus on. Many thanks for understanding. I will update once i hear back from JOSS but i don't want to modify our process on the fly until we've thought things through more completely.
many thanks for your time y'all!
one other note - i think it's great to open issues as you go but one other element that is important is documentation of what changed and why so there is a full record of the review so please if you open issues be sure to reference them in the text of the review in the context of why you opened them. that will allow the editor (and us as an organization) to keep track of the review in one place. again many thanks! we are learning as an organization in this process
fair enough @lwasser :) typically the way it works at JOSS is that you'll be opening issues on the repo and then linking to the review issue (this one) so that basically the review issue serves as a timeline of changes and discussion for the review (all edits to the review checklist comment are also logged). Happy to also do the inverse (link back to opened issues) as well.
So - for this review, will only post text of review when checklist finished, is that what you had in mind?
Some of the review checklist items feel a bit R-ish and I wonder if these are intentional, or they just haven't been updated yet after borrowing from rOpenSci. Is the term "vignette" used in the context of Python? Examples for each function are not typical for Python either (though I have to say that in my personal opinion it wouldn't be bad to borrow this habit from R 😉 )
I wonder if these are intentional.
Yes. If you prefer the term "tutorial", then feel free to use that.
though I have to say that in my personal opinion it wouldn't be bad to borrow this habit from R
Exactly.
For example see discussion of vignettes here on PyGMT: https://github.com/pyOpenSci/software-submission/issues/43#issuecomment-994023562
Or discussion of examples here on the jointly review: https://github.com/pyOpenSci/software-submission/issues/45#issuecomment-1001172229
Happy to also do the inverse (link back to opened issues) as well
@sneakers-the-rat if you can also link in comments that just helps make sure we don't miss it
So - for this review, will only post text of review when checklist finished, is that what you had in mind?
yes please
@szhorvat @sneakers-the-rat the checklists and workflow we have are fairly well established and up to date. They're designed in part so that @lwasser can parse the reviews in an automated way.
We're of course happy to learn from other communities-- @lwasser frequently talks with Arfon from JOSS, for example--but please let's try to follow the workflow we have in place for now.
I realize from how this review has started we probably can do more to make the workflow clearer. We're happy to hear feedback after the review on how we can improve the workflow and make it clearer--we can discuss in Slack or in the forum.
Thank you!
@NickleDave Although I did not respond in words (just a 👍 ), I noted the comment from @lwasser and I am happy to follow (I believe I am following) the process. If I missed anything, please point it out explicitly, and I'll do my best to correct it.
Thank you @szhorvat maybe I shouldn't comment on GitHub issues when I wake up grumpy :flushed:
I do hope the previous reviews linked will help though--we can probably add links to those in reviewer docs.
thank you all! @sneakers-the-rat @szhorvat we are super happy to discuss this more (maybe we could do this in discourse as an open spot to discuss? As @NickleDave said, we want to learn from other communities, however it's also important that our review approach is consistent. As such we want to discuss as a community before making changes to our processes for review. I hope that make sense. I am looking forward to chatting more!! And appreciate everyone's time here on this review!
@eriknw - clarification q on your intended audience statement above for the sake of providing feedback on the docs: are you expecting people to have prior experience with graphBLAS? or is the intention to be able to bring someone with ~reasonable understanding of sparse data or graphs up to speed as well? Would be happy to be the latter kind of reader and let you know what parts are clear/unclear if that's the intention, but if the expectation is for someone to already understand graphBLAS i'll spare u.
edit: eg. this - https://python-graphblas.readthedocs.io/en/stable/getting_started/primer.html and this - https://python-graphblas.readthedocs.io/en/stable/user_guide/fundamentals.html#c-to-python-mapping are a little uneven as far as expected expertise imo and i'd be happy to help smooth that out (but i appreciate the work y'all have done on the docs here)
FWIW - i think that it could be nice if the overview provides enough context for a user less experienced with graphBLAS to follow. it might welcome more users. for instance i was unfamiliar with the term "sparse matrix". then i looked i up and quickly realized i've used that kind of data structure before. so i was familiar with the data type but not the word. of course @eriknw knows best what is feasible here in terms of helping inform users of the package's goals.
my two cents :)
are you expecting people to have prior experience with graphBLAS? or is the intention to be able to bring someone with ~reasonable understanding of sparse data or graphs up to speed as well?
Yeah, I think we definitely want to support both, and I know there's work to do on the docs, so all feedback is appreciated. It would be great if we had documentation at an even higher level too such as example applications that data scientists can use without even needing to know much about graphs or sparse.
My personal priority right now is to improve maintenance documentation so that maintenance can continue smoothly if I disappeared for a time (for any reason). My second documentation priority is to establish best practices for how to add examples to docstrings. We have other maintainers and contributors who have improved and plan to continue improving documentation.
of course @eriknw knows best what is feasible here in terms of helping inform users of the package's goals.
I think you're giving me (and engineers in general) too much credit! I/we are sometimes much too close to engineering problems to remember the bigger picture of "who cares and why?". I would like for us to pivot to a "product-led growth" mindset/orientation this year now that much of the hard engineering is done.
@eriknw thank you for this feedback. so it sounds like it might be useful for @sneakers-the-rat @szhorvat to provide feedback from their perspective about what is confusing in the documentation. (such great questions you both asked erik!!) i wonder if there is some way for us to also help you with revisions if you need that support given your focus on maintenance docs (which i agree is profoundly important for long term maintenance). it's something that i've been thinking about but we don't have a huge team just yet to support that just yet...
Hi all - just checking in with @sneakers-the-rat and @szhorvat - thanks for the engagement, discussion and issues opened so far. Are you both on track to submit your initial reviews by the 29th?
(For context - the pyopensci process aims for 3 weeks for review, 2 weeks for subsequent changes, and 1 week for reviewer approval of changes.)
I arrived back on Sunday and immediately came down with a bad flu, so I won't be able to finish by tomorrow. If I recover in time, I still hope to finish by Sunday.
Hope you feel better soon @szhorvat! No pressure from my end, just helping to set expectations.
sorry the 29th has come and gone, i've been in a bit of a writing hole, just about to finish a draft of something this week and then was planning on finishing review next week!!!!
sorry again, still writing, just pinging to say i'm still here and that i'm going to put some time in on this right now!
Alright I made it through the checklist, there are some straggling thoughts in the remaining open issues. We can hash those out there and then I'll write my final review :)
Apologies about the long disappearance. I have not forgotten about the review. A series of unexpected events prevented me from completing it. I will be able to get back to it on May 1st Monday and hopefully complete it that week.
I'd like to share most of the PRs that the pyOpenSci reviews have strongly influenced:
Since the review began, we have also:
The reviewers have also indicated where we can/should refactor code (operators!) to make it more understandable and maintainable, and specific suggestions for how to further improve documentation. I expect implementing these changes will keep us busy for most of the year.
@sneakers-the-rat @szhorvat if there are any specific issues you would like us to address before you finish your reviews (or as conditional for acceptance), please let us know and please be specific. There are many issues, checklists, and discussions for us to read that it's hard to know exactly where things stand (but the reviews have been great--thank you!). For example, I think a specific enhancement that y'all probably really, really want is for the operator objects to be listed in the API reference of the documentation (definitely a good idea). Should we add this "quick and ugly" right now, then redo it later in the year after we refactor operators to be more declarative, or just do it later after the refactor?
checking in to say sorry I have been AWOL, have not been feeling well the past several weeks, but am back and work and plan on finishing my review tomorrow or Monday -- I have no outstanding requested changes that need to be immediate (and if it isn't/wasn't clear I don't believe in a review process where it is possible to "fail" the review except by nonparticipation, just to take that idea off the table to the degree that it even applies here)
For example, I think a specific enhancement that y'all probably really, really want is for the operator objects to be listed in the API reference of the documentation (definitely a good idea). Should we add this "quick and ugly" right now, then redo it later in the year after we refactor operators to be more declarative, or just do it later after the refactor?
specifically on this - I would not want to encourage you to do anything that is quick and dirty or otherwise not part of your normal planned development process. I trust y'all at this point to make recommended changes as you are able to, and given that there are several ways to list those functions/methods/classes as is, and the docs are to some degree intertwined with the graphviz docs, I don't think their immediate absence is disqualifying to the "all user-facing functions documented" check.
Thanks for checking in @sneakers-the-rat, and I hope you (and everyone) are feeling better.
I realize the timing on this review has been a bit... flexible, that March 29 (https://github.com/pyOpenSci/software-submission/issues/81#issuecomment-1458089302) has come and gone, and we're pretty far off the six week schedule that pyopensci aims for (https://github.com/pyOpenSci/software-submission/issues/81#issuecomment-1484897108).
If possible, can we refocus and try to get this review finished-finished before the SciPy conference, July 10th?
We've gotten a lot of value from this review already, and I appreciate all the time and effort the volunteers have spent. Thanks.
Hi @eriknw thanks for your patience, I'm glad you've found value in the process so far, and thanks for the call to action!
I'm going to suggest a refocussed timeline, given the process has already had some time for review-in-progress and response:
SORRY YES i am still here i am just trying to catch up at work. i can do the 21st for sure. i will shoot for tomorrow.
TODAY is the 21st and i am completing it TODAY
Here are my final review comments, I'll also add them into my review above. Thanks again to the authors for their work here, it was a pleasure reading the package <3
python-graphblas
is an exemplary package among scientific python packages, with excellent docs, tests, code quality, examples, and contributor engagement. The package is an opinionated wrapper and interface to GraphBLAS, which is well justified and differentiated from prior wrappers. Throughout the review process, the authors have been receptive to feedback and I have all faith that they will address any continuing suggestions I have in future development work. I have no hesitation recommending this package to anyone who wants to use GraphBLAS from Python.
My outstanding recommendations in the remaining open issues are all future suggestions that the authors can take or leave, none are mission-critical. I want to compliment the authors on this excellent work, I'm glad to have had a reason to have read it. I would be happy to respond to any questions the authors have about this review and otherwise continue to engage on open issues.
Library wrappers have their own sets of challenges and idioms, and python-graphblas
handles them well with some room for future improvement. Wrapping code needs to make decisions about how to map the underlying API between languages and interact with the C library. The strategy used to abstract and expose the underlying library API can have large impacts on the maintainability and readability of the code, as can strategies for keeping the version of the wrapping package in sync with the changing API of the wrapped library.
python-graphblas
further complicates these choices by admirably taking on the challenge of having multiple backend implementations (numpy and SuiteSparse:GraphBlas), and multiple export formats (NetworkX, Scipy.Sparse, PyData.Sparse, numpy, SuiteSparse, Matrix Market).
The python-graphblas
team has chosen a bifurcated abstraction style which fits with the design of their wrapper. Rather than a transparent wrapper, the package introduces its own calling syntax to logically separate the i/o parts of a GraphBLAS call from the computation, which is well described and justified in the documentation.
At the time of reviewing,
core
module (eg core.vector.Vector
) graphblas.unary
)
core.operator
module, which then makes the symbols available in the top-level module namespace as a side effect of importing, eg. (graphblas.unary.ss
)core
module also implements the other syntax elements like infix operators that are used throughout the package. This structure makes for a nice user-facing interface, being able to refer to classes and methods in a natural way (eg. gb.Vector
, gb.unary.abs
), obscuring the underlying code structure.
The tradeoff is the significant amount of indirection and implicitness in the code which presents a relatively high barrier to entry for new contributors. The mappings between the suitesparse GraphBLAS implementation and Python are programmed in several places, eg. within a method and then again in the tests, which the authors describe as being useful as a consistency check. Having some parts of the library dependent on import time side-effects is less than optimal and makes the code more difficult to reason about, but is certainly not fatal to the usability of the package. Again, the nature of a wrapping package requires making decisions about abstraction, so the only concern I have for the current code structure is the impact on maintainability. The current maintainers seem to have no trouble reasoning about the package, and I believe they are aware of these challenges and are actively working on them (they refactored a formerly massive operator
file during the course of the review (https://github.com/python-graphblas/python-graphblas/pull/420)). The maintainers can do themselves and future contributors a favor by writing some additional developer docs that explain the structure of the library in greater detail, and I believe they will!
The other question I had was about the relationship between versioning in the API implementation in Python and in the underlying C library. GraphBLAS is a relatively mature API and seems to be only rarely changed, with care for backwards compatibility, so this is less of a concern than for a wrapper around a more actively evolving API. The authors have chosen to formally support a specific version of GraphBLAS ( https://github.com/python-graphblas/python-graphblas/pull/441 ) rather than build version-compatibility infrastructure within the package, which seems like an appropriate decision to me, given their other comments on how refactoring their backend system is on their development roadmap.
I want to also emphasize several of the "nice to have" features in the package, including a very impressive Recorder class that can keep a record of all the GraphBLAS calls made for debugging and reproducibility, automatic compilation of user-supplied functions including lambdas using numba, and the excellent i/o module. These indicate the authors are actively invested in user experience above and beyond the tidy API they expose.
Aside from the above comments, the fundamentals of the package design are strong: modern packaging with setuptools, automated code quality checks, CI, and coverage reports. The decisions made in code structure seem principled, responsive to the constraints of the problem, and result in a very usable user interface - my compliments to the Authors.
The package is well documented from an introduction to the problem that GraphBLAS attempts to solve through package design decisions and practical examples. It is suitable for a general audience with some exposure to graph operations and programming in python, which is impressive given the highly specialized nature of the library.
Future suggestions for the authors include embedding their example notebooks in the documentation, and improving the API documentation. Currently, I assume due to some of the abstraction decisions made in the rest of the package, there is not comprehensive documentation of every operation available to the user that might otherwise be accomplished with calling autodoc
over a class or a module, but some common operators are listed here and the list of available operators are present in the GraphBLAS documentation as well as through object inspection within an interactive Python session. Given the nature of wrapping code where the underlying operations are well documented elsewhere, this is less of a problem than it would be in other packages.
Altogether the docs are excellent with several clear points of improvement, but far above average in the landscape of scientific python packages.
Who among us can claim 100% test coverage? https://coveralls.io/github/python-graphblas/python-graphblas
The tests are well organized and comprehensive, and I was able to find a corresponding test for every package feature I looked at easily. In more security-sensitive contexts one would want to do more adversarial input fuzzing, but I don't think that's all that relevant here since I've never seen graph data analysis libraries used as a malware vector. I have no notes on the tests, this is good work.
Thanks for the patience everyone.
tl;dr This is a very nice package, technically sound, with a well-thought out Pythonic interface. My perception was that in order to bring GraphBLAS to Python users (and thus fully realize its promise of implementing graph algorithm in terms of high-level building blocks), what we need the most is Python-centric documentation and training materials.
First of all, let me say that python-graphblas is a well thought out and technically solid package. It provides a Pythonic interface to the C-based GraphBLAS API, currently supporting SuiteSparse:GraphBLAS (which is the only usable GraphBLAS implementatin available at the moment).
The review should concern python-graphblas, but it's impossible to talk about it without also discussing GraphBLAS itself. And here I must make it clear that I am a newcomer to GraphBLAS. GraphBLAS is a sandardized C API aiming to provide high-level yet general building blocks for graph algorithms, similarly to how BLAS and LAPACK provide building blocks for linear algebra. It does this through some very elegant math, employing the language of linear algebra and abstract algebra, generalizing the usual (+, *)
operations to a wider class of semirings. For example, with the usual (+, *)
matrix product the k
th power of a graph's adjacency matrix gives the number of walks between any two vertices. Replacing +
by min
and *
by +
will give us shortest path lengths up to k
instead. What is not yet clear to me where the limits of this approach are: Are there common graph concepts that cannot be expressed in this framework? Are there some which can be expressed, but the most efficient algorithms to compute them can't be expressed with GraphBLAS?
python-graphblas is realizing one of the major advantages of the GraphBLAS approach: it is now possible to implement graph algorithms in a convenient and high-level langauge like Python without the need for explicit loops, and thus without a significant loss of performance. Indeed, the python-graphblas authors have also created the graphblas-algorithm package, which achieves performance comparable to C/C++ libraries while being written in pure Python. (To be fair, it's not clear to me how much of the performance comes from parallelization, but the benchmarks are impressive.)
This brings me to a point about how python-graphblas is presented. This is the first paragraph in the docs:
python-graphblas is a Pythonic interface to the highly performant SuiteSparse:GraphBLAS library for performing graph analytics in the language of linear algebra.
This makes the impression that python-graphblas is just an interface to a C library, perhaps even aimed at people who already understand that library. I think this is backwards: if one of the main GraphBLAS advantages is usability from high-level languages, then the target audience should be users of such languages. It seems to me that a project like this should spend at least as much effort on good documentation and training materials as on technical bindings. And this is precisely the weak point of python-graphblas. It is impossible to learn the system without referring to external material that describes the C API. In fact the competing package pygraphblas seems to be doing a little bit better on this front, though still not well enough for users to be able to avoid C-based documentation.
My main recommendation---and I realize that this is a long-term project---is making significant improvements to the documentation with the pure-Python user in mind.
In the intro:
More in depth:
In addition to the technically sound and well thought out GraphBLAS interface, the package contains functionality for interoperability with graph theory / network analysis tools, as well as some visualization tools. The quality of these is in need of some improvement:
I opened some visualization issues, https://github.com/python-graphblas/python-graphblas/issues/473 https://github.com/python-graphblas/python-graphblas/issues/474 https://github.com/python-graphblas/python-graphblas/issues/475
All this said, interoperability and visualization are not core functionality. As I see it, they are for convenience, and the package should not be judges based on these.
Thanks for the above, @sneakers-the-rat and @szhorvat !
@eriknw @jim22k @SultanOrazbayev - recognising that you've been engaged with the review already, and that some things may be pushed to a longer timeframe, are you happy to respond and make any priority changes by the end of the week (30th June)?
I applaud the reviewers 👏 . I never expected such thorough and honest reviews. Thank you for positive feedback and the criticisms. I agree with all of it--I think you nailed both the strengths and the weaknesses. The reviews are valuable and I suspect will help shape our vision and effort for the next couple of years.
In particular, there are two main areas we need to give more attention:
Heh, this is probably true for many projects. My main focus in the near and medium term will be maintainability.
Now if I may ramble on for a bit...
It's interesting that reviews occur at a specific moment in time. I know the history of python-graphblas
, and, trust me, it has evolved significantly in every 6 month period of its 3.5 year lifetime. If we had been writing pristine user-facing documentation from the beginning, it would have needed to be rewritten and revised endlessly. Functionality would probably be 1.5-2 years behind where we are today. If you're curious, go back and look at versions around 1.3.8
to 1.3.14
. It's recognizable, but so different and missing so, so much!
Anyway. I want to highlight this comment:
if one of the main GraphBLAS advantages is usability from high-level languages, then the target audience should be users of such languages
Absolutely! I agree 100%. We aspire to this. It will take time.
From a product perspective, I/we wanted to get the syntax and functionality stable enough to begin writing graphblas-algorithms
in earnest. We are targeting networkx users and are adding dispatching to networkx.
Oh, and if you're curious why our test coverage is so high, it's' for multiple reasons:
cygraphblas
, and thorough coverage would give me more confidence
Wrapping up... I think we have replied to all open issues from the review. They will keep us busy, that's for sure.
@jim22k @SultanOrazbayev want to say anything else? I don't think it's necessary for you to comment here.
Thanks again all, hope to see you around ❤️ !
Thanks @eriknw (also for the lovely tone and for the background and bit of history)!
@szhorvat and @sneakers-the-rat for complete clarity, can you confirm you're happy with responses?
Yes. The suggestions I made are mostly for the long term.
Same, full approval from me :)
Thanks all, time and effort very much appreciated. The review process is done!
All that's left is to wrap up, publish the version of record and acknowledge all your contributions.
🎉 python-graphblas
has been approved by pyOpenSci! Thank you @eriknw for submitting python-graphblas
and many thanks to @szhorvat and @sneakers-the-rat for reviewing this package! 😸
There are just a few things left to do to wrap up this submission, @eriknw, @jim22k, @SultanOrazbayev:
python-graphblas
. The badge should be [![pyOpenSci](https://tinyurl.com/y22nb8up)](https://github.com/pyOpenSci/software-review/issues/81)
.python-graphblas
to the pyOpenSci website. Please open a PR to update this file: to add your package and name to the list of contributors.Both reviewers and maintainers (@sneakers-the-rat, @szhorvat too):
Please complete the final steps to wrap up this review. @tomalrussell, please do the following:
6/pyOS-approved6 🚀🚀🚀
.If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.
Submitting Author: Erik Welch (@eriknw) All current maintainers: (@eriknw, @jim22k, @SultanOrazbayev) Package Name: Python-graphblas One-Line Description of Package: Python library for GraphBLAS: high-performance sparse linear algebra for scalable graph analytics Repository Link: https://github.com/python-graphblas/python-graphblas Version submitted: 2023.1.0 Editor: @tomalrussell Reviewer 1: @sneakers-the-rat Reviewer 2: @szhorvat Archive:
JOSS DOI: N/A Version accepted: 2023.7.0 Date accepted (month/day/year): 07/14/2023
Description
Python-graphblas is like a faster, more capable
scipy.sparse
that can implement NetworkX. It is a Python library for GraphBLAS: high-performance sparse linear algebra for scalable graph analytics. Python-graphblas mimics the math notation, making it the most natural way to learn, use, and think about GraphBLAS. In contrast to other high level GraphBLAS bindings, Python-graphblas can fully and cleanly support any implementation of the GraphBLAS C API specification thereby allowing us to be vendor-agnostic.Scope
For all submissions, explain how the and why the package falls under the categories you indicated above. In your explanation, please address the following points (briefly, 1-2 sentences for each):
Audience: anybody who works with sparse data or graphs. We are also implementing a backend to NetworkX (which supports dispatching in version 3.0) written in Python-graphblas called
graphblas-algorithms
, so we are quite literally targeting NetworkX users!Python-graphblas provides a faster, easier, more flexible, and more scalable way to operate on sparse data, including for graph algorithms. There are too many scientific applications to list ranging from neuroscience, genomics, biology, etc. It may be useful wherever
scipy.sparse
or NetworkX are used. Although GraphBLAS was designed to build graph algorithms, it is flexible enough to be used in other applications. Anecdotally, most of our current users that I know about are from research groups in universities and laboratories.We are also targeting applications that need very large distributed graphs. We have experimented with Dask-ifying
python-graphblas
here, and we get regular interest from people who want e.g. distributed PageRank or connected components.pygraphblas
, which hasn't been updated in more than 16 months. There are many differences in syntax, functionality, philosophy, architecture, and (I would argue) robustness and maturity.python-graphblas
syntax targets the math syntax, whereaspygraphblas
is much closer to C.python-graphblas
handles dtypes much more robustly, has efficient conversions to/from numpy and other formats, is architected to handle additional GraphBLAS implementations (more are on the way!), has exceptional error messages, has many more tests and functionality, supports Windows, and much, much more. We have also been growing our team, because sustainability is very important to us.Although we have/had irreconcilable differences (which is why we decided to create
python-graphblas
), the authors have always been cordial. We all believe strongly in the ethos of open source, and I would describe our relationship as having "radical generosity". For example, we have an outstanding agreement that each library is welcome to "borrow" from the other (with credit). We may "borrow" some of their documentation :)We also worked together to create and maintain the C binding to SuiteSparse:GraphBLAS: https://github.com/GraphBLAS/python-suitesparse-graphblas/ We could use help automatically generating wheels for this library on major platforms via cibuildwheel.
@tag
the editor you contacted:Limited prior discussion in this issue: https://github.com/pyOpenSci/python-package-guide/issues/21#issuecomment-1368046000
Technical checks
For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:
Publication options
JOSS Checks
- [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?
This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.
Code of conduct
Other comments (manually added)
Given a product mindset, we believe that Python-graphblas is a great product, but I think our go-to-market strategy has been lacking. We have been very engineering-heavy, and even our goal of targeting NetworkX users is engineering-heavy via creating
graphblas-algorithms
. I hope this peer-review process can help us prioritize our efforts (such as a plan to improve documentation) as well as a place to write a blog post or two.Please fill out our survey
P.S. *Have feedback/comments about our review process? Leave a comment here
Editor and Review Templates
The editor template can be found here.
The review template can be found here.