pyOpenSci / software-submission

Submit your package for review by pyOpenSci here! If you have questions please post them here: https://pyopensci.discourse.group/
93 stars 36 forks source link

Python-graphblas: high-performance sparse linear algebra for scalable graph analytics #81

Closed eriknw closed 1 year ago

eriknw commented 1 year ago

Submitting Author: Erik Welch (@eriknw) All current maintainers: (@eriknw, @jim22k, @SultanOrazbayev) Package Name: Python-graphblas One-Line Description of Package: Python library for GraphBLAS: high-performance sparse linear algebra for scalable graph analytics Repository Link: https://github.com/python-graphblas/python-graphblas Version submitted: 2023.1.0 Editor: @tomalrussell Reviewer 1: @sneakers-the-rat Reviewer 2: @szhorvat Archive: DOI
JOSS DOI: N/A Version accepted: 2023.7.0 Date accepted (month/day/year): 07/14/2023


Description

Python-graphblas is like a faster, more capable scipy.sparse that can implement NetworkX. It is a Python library for GraphBLAS: high-performance sparse linear algebra for scalable graph analytics. Python-graphblas mimics the math notation, making it the most natural way to learn, use, and think about GraphBLAS. In contrast to other high level GraphBLAS bindings, Python-graphblas can fully and cleanly support any implementation of the GraphBLAS C API specification thereby allowing us to be vendor-agnostic.

Scope

Please fill out a pre-submission inquiry before submitting a data visualization package. For more info, see notes on categories of our guidebook.

Audience: anybody who works with sparse data or graphs. We are also implementing a backend to NetworkX (which supports dispatching in version 3.0) written in Python-graphblas called graphblas-algorithms, so we are quite literally targeting NetworkX users!

Python-graphblas provides a faster, easier, more flexible, and more scalable way to operate on sparse data, including for graph algorithms. There are too many scientific applications to list ranging from neuroscience, genomics, biology, etc. It may be useful wherever scipy.sparse or NetworkX are used. Although GraphBLAS was designed to build graph algorithms, it is flexible enough to be used in other applications. Anecdotally, most of our current users that I know about are from research groups in universities and laboratories.

We are also targeting applications that need very large distributed graphs. We have experimented with Dask-ifying python-graphblas here, and we get regular interest from people who want e.g. distributed PageRank or connected components.

pygraphblas, which hasn't been updated in more than 16 months. There are many differences in syntax, functionality, philosophy, architecture, and (I would argue) robustness and maturity. python-graphblas syntax targets the math syntax, whereas pygraphblas is much closer to C. python-graphblas handles dtypes much more robustly, has efficient conversions to/from numpy and other formats, is architected to handle additional GraphBLAS implementations (more are on the way!), has exceptional error messages, has many more tests and functionality, supports Windows, and much, much more. We have also been growing our team, because sustainability is very important to us.

Although we have/had irreconcilable differences (which is why we decided to create python-graphblas), the authors have always been cordial. We all believe strongly in the ethos of open source, and I would describe our relationship as having "radical generosity". For example, we have an outstanding agreement that each library is welcome to "borrow" from the other (with credit). We may "borrow" some of their documentation :)

We also worked together to create and maintain the C binding to SuiteSparse:GraphBLAS: https://github.com/GraphBLAS/python-suitesparse-graphblas/ We could use help automatically generating wheels for this library on major platforms via cibuildwheel.

Limited prior discussion in this issue: https://github.com/pyOpenSci/python-package-guide/issues/21#issuecomment-1368046000

Technical checks

For details about the pyOpenSci packaging requirements, see our packaging guide. Confirm each of the following by checking the box. This package:

Publication options

JOSS Checks - [ ] The package has an **obvious research application** according to JOSS's definition in their [submission requirements][JossSubmissionRequirements]. Be aware that completing the pyOpenSci review process **does not** guarantee acceptance to JOSS. Be sure to read their submission requirements (linked above) if you are interested in submitting to JOSS. - [ ] The package is not a "minor utility" as defined by JOSS's [submission requirements][JossSubmissionRequirements]: "Minor ‘utility’ packages, including ‘thin’ API clients, are not acceptable." pyOpenSci welcomes these packages under "Data Retrieval", but JOSS has slightly different criteria. - [ ] The package contains a `paper.md` matching [JOSS's requirements][JossPaperRequirements] with a high-level description in the package root or in `inst/`. - [ ] The package is deposited in a long-term repository with the DOI: *Note: Do not submit your package separately to JOSS*

Are you OK with Reviewers Submitting Issues and/or pull requests to your Repo Directly?

This option will allow reviewers to open smaller issues that can then be linked to PR's rather than submitting a more dense text based review. It will also allow you to demonstrate addressing the issue via PR links.

Code of conduct

Other comments (manually added)

Given a product mindset, we believe that Python-graphblas is a great product, but I think our go-to-market strategy has been lacking. We have been very engineering-heavy, and even our goal of targeting NetworkX users is engineering-heavy via creating graphblas-algorithms. I hope this peer-review process can help us prioritize our efforts (such as a plan to improve documentation) as well as a place to write a blog post or two.

Please fill out our survey

P.S. *Have feedback/comments about our review process? Leave a comment here

Editor and Review Templates

The editor template can be found here.

The review template can be found here.

NickleDave commented 1 year ago

Hi @eriknw!

We're very glad to see you all have gone ahead with a full submission, after discussion in https://github.com/pyOpenSci/python-package-guide/issues/21#issuecomment-1368046000 as you linked above.

I just want to welcome you and let you know we are working on this.

I will finish the initial editor checks by the end of this week. @lwasser is traveling and we will need to co-ordinate about editors, but I expect to get back to you about that by the middle of next week at the latest.

Thank you for providing all the detail in the submission. The context you've provided will be helpful for the review (and will definitely help me with editor checks!). It sounds like you've anticipated some points @lwasser brought up when we discussed in Slack. Looking forward to helping you improve the docs and giving you some blog post material! :grin:

NickleDave commented 1 year ago

Editor in Chief checks

These are the basic checks that the package needs to pass to begin review. Please check our Python packaging guide for more information on the elements below.



Editor comments

This package passes all checks; we can begin review.

It's obvious the developers and maintainers have done a ton of work.
Our goal here should be to help make sure everyone can appreciate how much work they have done, and how much functionality is packed into python-graphblas.

Along those lines, some notes for the review (none of these need to be addressed before we start):

NickleDave commented 1 year ago

@eriknw @jim22k, @SultanOrazbayev the tl;dr is that python-graphblas passed editor checks 🙂 🎉

Like I said above, I'll need to co-ordinate with @lwasser who is traveling about an editor for this review, but would expect us to reply back here by middle of next week at the latest

eriknw commented 1 year ago

Hooray! 🎉

Thanks @NickleDave. We appreciate the attention y'all are giving us, and thanks for telling us what (and when) to expect next. We're in no particular rush--it's more important to give the right people the right amount of time to do things right :)

NickleDave commented 1 year ago

it's more important to give the right people the right amount of time to do things right

🙌

finding the right people now! 🙂

NickleDave commented 1 year ago

Hi again @eriknw, @jim22k, @SultanOrazbayev -- just letting you know that I did have a chance to talk with @lwasser now that she has returned, and we are in the process of finding an editor

When you have a chance could you please (all) fill out the pre-review survey?
It's here: https://forms.gle/F9mou7S3jhe8DMJ16

We appreciate each maintainer of the package filling out this survey individually. 🙌 Thank you authors in advance for setting aside five to ten minutes to do this. It truly helps our organization. 🙌

(I know it's easy to miss in the template)

NickleDave commented 1 year ago

Hi @eriknw, @jim22k, @SultanOrazbayev, brief update:
very happy to inform you that @tomalrussell will be guest editor for this review! 🎉 NetworkX contributor, developer of spatial & network tools like snkit.
I will let @tomalrussell take it from here!

tomalrussell commented 1 year ago

Hi @eriknw, @jim22k, @SultanOrazbayev, and thanks to @NickleDave for the introduction.

I've reached out to potential reviewers, and incidentally look forward to taking a closer look at python-graphblas myself. I'll update here as soon as, definitely within a week.

tomalrussell commented 1 year ago

:wave: Hi @sneakers-the-rat and @szhorvat! Thank you for volunteering to review for pyOpenSci!

The following resources will help you complete your review:

  1. Here is the reviewers guide. This guide contains all of the steps and information needed to complete your review.
  2. Here is the review template that you will need to fill out and submit here as a comment, once your review is complete.

Please get in touch with any questions or concerns! Your review is due in three weeks: 29th March 2023

szhorvat commented 1 year ago

Hello everyone 👋

I'm excited to do this review and learn more about the GraphBLAS approach in general.

I plan do the review gradually, and through continuous communication with the authors. I will make it clear when I consider the review to be completed. Feel free to respond to anything I might bring up before then. The same applies to the review checklist: I will post it below today, and will check off boxes gradually.

Any issues I open for python-graphblas will have a title prefixed with [pyos] and a link back here. You'll probably see me in the discussion forum as well, as I will likely need a bit of help while trying to solve a few toy problems with the library.

Expect comments mostly on mathematical aspects, correctness, docs, and usability from me. Hopefully other reviewers will cover the more technical aspects of Python.

I will aim to complete the review by April 2nd. Since the authors will have the opportunity to address concerns before then, I hope this little delay over the 3 week deadline will be fine. I will not be available during the week of the 20th.

Let me know if you'd like any changes to this arrangement—I can be flexible.

For transparency, I should note that am involved with the igraph project (https://igraph.org/). igraph is not a competitor to python-graphblas, but it does have similar aims to NetworkX and in extension to graphblas-algorithms. I think this review will be a good opportunity for us to learn form each other.

szhorvat commented 1 year ago

Review is now complete.


Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

For packages also submitting to JOSS

Note: Be sure to check this carefully, as JOSS's submission requirements and scope differ from pyOpenSci's in terms of what types of packages are accepted.

The package contains a paper.md matching JOSS's requirements with:

Final approval (post-review)


Review Comments

szhorvat commented 1 year ago

@tomalrussell Could you please clarify if the authors are expected to follow all the above checklist items to the letter? (E.g., all mentioned badges mandatory, links to all "vignettes" from the README, all functions have examples of use, etc.)


@eriknw :

includes documentation with examples for all functions.

  • We're working on this! The C API and SuiteSparse:GraphBLAS C library are both well documented. We have a very large API surface area to cover, so "documentation with examples for all functions" is a really, really high bar, but one I hope we achieve someday :)

Indeed, the documentation of most Python projects won't give usage examples for every single function, and doing so would definitely be a lot of work. But just to show that it is possible, and often tremendously useful to users, I wanted to point to Mathematica's documentation where each function has not one, but many examples. See e.g. LinearSolve. I tried to follow the same with my IGraph/M package for Mathematica, but it's still a work in progress. R packages also often have at least one example for each function.

sneakers-the-rat commented 1 year ago

Indeed, the documentation of most Python projects won't give usage examples for every single function, and doing so would definitely be a lot of work.

From what I recall we had a conversation at some point about allowing the author to define what is intended as the public interface of the package and what isn't? but ya for packages that wrap another library it seems like a lot of extra work if, eg. there are examples from the main library that are trivially different (ie. could be inferred) from the wrapper's API.

sneakers-the-rat commented 1 year ago

I'll also be doing this JOSS-style, leaving this here and editing/raising issues as I go. I like @szhorvat 's idea of using an issue tag, so i'll also prefix mine with [pyos].

I'm happy to focus on more of the python implementation side of things, glad to have someone who's more adept with the math :).

I don't have any conflicts to declare, except that I'm going to be writing some triplet store code soon, but that doesn't really relate or create a conflict imo.

I don't have an expected completion date, but i have this in my calendar as a daily todo item and welcome being relentessly pinged if i am the one holding us up :)

Review status

Package Review

Please check off boxes as applicable, and elaborate in comments below. Your review is not limited to these topics, as described in the reviewer guide

Documentation

The package includes all the following forms of documentation:

Readme file requirements The package meets the readme requirements below:

The README should include, from top to bottom:

NOTE: If the README has many more badges, you might want to consider using a table for badges: see this example. Such a table should be more wide than high. (Note that the a badge for pyOpenSci peer-review will be provided upon acceptance.)

Usability

Reviewers are encouraged to submit suggestions (or pull requests) that will improve the usability of the package as a whole. Package structure should follow general community best-practices. In general please consider whether:

Functionality

In trying to summarize what the functional claims of the software were, I think it basically boils down to

For packages also submitting to JOSS

N/A

Final approval (post-review)

Estimated hours spent reviewing: (final estimate) ~14h


Review

General Comments

python-graphblas is an exemplary package among scientific python packages, with excellent docs, tests, code quality, examples, and contributor engagement. The package is an opinionated wrapper and interface to GraphBLAS, which is well justified and differentiated from prior wrappers. Throughout the review process, the authors have been receptive to feedback and I have all faith that they will address any continuing suggestions I have in future development work. I have no hesitation recommending this package to anyone who wants to use GraphBLAS from Python.

My outstanding recommendations in the remaining open issues are all future suggestions that the authors can take or leave, none are mission-critical. I want to compliment the authors on this excellent work, I'm glad to have had a reason to have read it. I would be happy to respond to any questions the authors have about this review and otherwise continue to engage on open issues.

Code Quality

Library wrappers have their own sets of challenges and idioms, and python-graphblas handles them well with some room for future improvement. Wrapping code needs to make decisions about how to map the underlying API between languages and interact with the C library. The strategy used to abstract and expose the underlying library API can have large impacts on the maintainability and readability of the code, as can strategies for keeping the version of the wrapping package in sync with the changing API of the wrapped library.

python-graphblas further complicates these choices by admirably taking on the challenge of having multiple backend implementations (numpy and SuiteSparse:GraphBlas), and multiple export formats (NetworkX, Scipy.Sparse, PyData.Sparse, numpy, SuiteSparse, Matrix Market).

The python-graphblas team has chosen a bifurcated abstraction style which fits with the design of their wrapper. Rather than a transparent wrapper, the package introduces its own calling syntax to logically separate the i/o parts of a GraphBLAS call from the computation, which is well described and justified in the documentation.

At the time of reviewing,

This structure makes for a nice user-facing interface, being able to refer to classes and methods in a natural way (eg. gb.Vector, gb.unary.abs), obscuring the underlying code structure.

The tradeoff is the significant amount of indirection and implicitness in the code which presents a relatively high barrier to entry for new contributors. The mappings between the suitesparse GraphBLAS implementation and Python are programmed in several places, eg. within a method and then again in the tests, which the authors describe as being useful as a consistency check. Having some parts of the library dependent on import time side-effects is less than optimal and makes the code more difficult to reason about, but is certainly not fatal to the usability of the package. Again, the nature of a wrapping package requires making decisions about abstraction, so the only concern I have for the current code structure is the impact on maintainability. The current maintainers seem to have no trouble reasoning about the package, and I believe they are aware of these challenges and are actively working on them (they refactored a formerly massive operator file during the course of the review (https://github.com/python-graphblas/python-graphblas/pull/420)). The maintainers can do themselves and future contributors a favor by writing some additional developer docs that explain the structure of the library in greater detail, and I believe they will!

The other question I had was about the relationship between versioning in the API implementation in Python and in the underlying C library. GraphBLAS is a relatively mature API and seems to be only rarely changed, with care for backwards compatibility, so this is less of a concern than for a wrapper around a more actively evolving API. The authors have chosen to formally support a specific version of GraphBLAS ( https://github.com/python-graphblas/python-graphblas/pull/441 ) rather than build version-compatibility infrastructure within the package, which seems like an appropriate decision to me, given their other comments on how refactoring their backend system is on their development roadmap.

I want to also emphasize several of the "nice to have" features in the package, including a very impressive Recorder class that can keep a record of all the GraphBLAS calls made for debugging and reproducibility, automatic compilation of user-supplied functions including lambdas using numba, and the excellent i/o module. These indicate the authors are actively invested in user experience above and beyond the tidy API they expose.

Aside from the above comments, the fundamentals of the package design are strong: modern packaging with setuptools, automated code quality checks, CI, and coverage reports. The decisions made in code structure seem principled, responsive to the constraints of the problem, and result in a very usable user interface - my compliments to the Authors.

Docs

The package is well documented from an introduction to the problem that GraphBLAS attempts to solve through package design decisions and practical examples. It is suitable for a general audience with some exposure to graph operations and programming in python, which is impressive given the highly specialized nature of the library.

Future suggestions for the authors include embedding their example notebooks in the documentation, and improving the API documentation. Currently, I assume due to some of the abstraction decisions made in the rest of the package, there is not comprehensive documentation of every operation available to the user that might otherwise be accomplished with calling autodoc over a class or a module, but some common operators are listed here and the list of available operators are present in the GraphBLAS documentation as well as through object inspection within an interactive Python session. Given the nature of wrapping code where the underlying operations are well documented elsewhere, this is less of a problem than it would be in other packages.

Altogether the docs are excellent with several clear points of improvement, but far above average in the landscape of scientific python packages.

Tests

Who among us can claim 100% test coverage? https://coveralls.io/github/python-graphblas/python-graphblas

The tests are well organized and comprehensive, and I was able to find a corresponding test for every package feature I looked at easily. In more security-sensitive contexts one would want to do more adversarial input fuzzing, but I don't think that's all that relevant here since I've never seen graph data analysis libraries used as a malware vector. I have no notes on the tests, this is good work.

Issues Opened:

tomalrussell commented 1 year ago

Thanks both, great start 💯 - I'll aim to check in occasionally or as needed.

Could you please clarify if the authors are expected to follow all the above checklist items to the letter? (E.g., all mentioned badges mandatory, links to all "vignettes" from the README, all functions have examples of use, etc.)

I would comment on anything you notice and let the authors respond, we can always exercise judgment if "to the letter" seems unhelpful.

lwasser commented 1 year ago

hey y'all. We normally prefer that reviews happen "all at once" in the sense that we're prefer the text of the review to NOT change once submitted and the conversation to happen after. I have reached out to JOSS about how the implement their reviews but I don't want to change our policy on how reviews happen (submitted all at once) until i've spoken to JOSS. @sneakers-the-rat @szhorvat it's fine if you want to leave the review text and check things off and open issues as you go but i prefer that the text of the review that you add to be added all at once to avoid any confusion regarding when you review is complete and what the maintainer of the package should focus on. Many thanks for understanding. I will update once i hear back from JOSS but i don't want to modify our process on the fly until we've thought things through more completely.

many thanks for your time y'all!

lwasser commented 1 year ago

one other note - i think it's great to open issues as you go but one other element that is important is documentation of what changed and why so there is a full record of the review so please if you open issues be sure to reference them in the text of the review in the context of why you opened them. that will allow the editor (and us as an organization) to keep track of the review in one place. again many thanks! we are learning as an organization in this process

sneakers-the-rat commented 1 year ago

fair enough @lwasser :) typically the way it works at JOSS is that you'll be opening issues on the repo and then linking to the review issue (this one) so that basically the review issue serves as a timeline of changes and discussion for the review (all edits to the review checklist comment are also logged). Happy to also do the inverse (link back to opened issues) as well.

So - for this review, will only post text of review when checklist finished, is that what you had in mind?

szhorvat commented 1 year ago

Some of the review checklist items feel a bit R-ish and I wonder if these are intentional, or they just haven't been updated yet after borrowing from rOpenSci. Is the term "vignette" used in the context of Python? Examples for each function are not typical for Python either (though I have to say that in my personal opinion it wouldn't be bad to borrow this habit from R 😉 )

NickleDave commented 1 year ago

I wonder if these are intentional.

Yes. If you prefer the term "tutorial", then feel free to use that.

though I have to say that in my personal opinion it wouldn't be bad to borrow this habit from R

Exactly.

For example see discussion of vignettes here on PyGMT: https://github.com/pyOpenSci/software-submission/issues/43#issuecomment-994023562

Or discussion of examples here on the jointly review: https://github.com/pyOpenSci/software-submission/issues/45#issuecomment-1001172229

NickleDave commented 1 year ago

Happy to also do the inverse (link back to opened issues) as well

@sneakers-the-rat if you can also link in comments that just helps make sure we don't miss it

So - for this review, will only post text of review when checklist finished, is that what you had in mind?

yes please

@szhorvat @sneakers-the-rat the checklists and workflow we have are fairly well established and up to date. They're designed in part so that @lwasser can parse the reviews in an automated way.

We're of course happy to learn from other communities-- @lwasser frequently talks with Arfon from JOSS, for example--but please let's try to follow the workflow we have in place for now.

I realize from how this review has started we probably can do more to make the workflow clearer. We're happy to hear feedback after the review on how we can improve the workflow and make it clearer--we can discuss in Slack or in the forum.

Thank you!

szhorvat commented 1 year ago

@NickleDave Although I did not respond in words (just a 👍 ), I noted the comment from @lwasser and I am happy to follow (I believe I am following) the process. If I missed anything, please point it out explicitly, and I'll do my best to correct it.

NickleDave commented 1 year ago

Thank you @szhorvat maybe I shouldn't comment on GitHub issues when I wake up grumpy :flushed:
I do hope the previous reviews linked will help though--we can probably add links to those in reviewer docs.

lwasser commented 1 year ago

thank you all! @sneakers-the-rat @szhorvat we are super happy to discuss this more (maybe we could do this in discourse as an open spot to discuss? As @NickleDave said, we want to learn from other communities, however it's also important that our review approach is consistent. As such we want to discuss as a community before making changes to our processes for review. I hope that make sense. I am looking forward to chatting more!! And appreciate everyone's time here on this review!

sneakers-the-rat commented 1 year ago

@eriknw - clarification q on your intended audience statement above for the sake of providing feedback on the docs: are you expecting people to have prior experience with graphBLAS? or is the intention to be able to bring someone with ~reasonable understanding of sparse data or graphs up to speed as well? Would be happy to be the latter kind of reader and let you know what parts are clear/unclear if that's the intention, but if the expectation is for someone to already understand graphBLAS i'll spare u.

edit: eg. this - https://python-graphblas.readthedocs.io/en/stable/getting_started/primer.html and this - https://python-graphblas.readthedocs.io/en/stable/user_guide/fundamentals.html#c-to-python-mapping are a little uneven as far as expected expertise imo and i'd be happy to help smooth that out (but i appreciate the work y'all have done on the docs here)

lwasser commented 1 year ago

FWIW - i think that it could be nice if the overview provides enough context for a user less experienced with graphBLAS to follow. it might welcome more users. for instance i was unfamiliar with the term "sparse matrix". then i looked i up and quickly realized i've used that kind of data structure before. so i was familiar with the data type but not the word. of course @eriknw knows best what is feasible here in terms of helping inform users of the package's goals.

my two cents :)

eriknw commented 1 year ago

are you expecting people to have prior experience with graphBLAS? or is the intention to be able to bring someone with ~reasonable understanding of sparse data or graphs up to speed as well?

Yeah, I think we definitely want to support both, and I know there's work to do on the docs, so all feedback is appreciated. It would be great if we had documentation at an even higher level too such as example applications that data scientists can use without even needing to know much about graphs or sparse.

My personal priority right now is to improve maintenance documentation so that maintenance can continue smoothly if I disappeared for a time (for any reason). My second documentation priority is to establish best practices for how to add examples to docstrings. We have other maintainers and contributors who have improved and plan to continue improving documentation.

of course @eriknw knows best what is feasible here in terms of helping inform users of the package's goals.

I think you're giving me (and engineers in general) too much credit! I/we are sometimes much too close to engineering problems to remember the bigger picture of "who cares and why?". I would like for us to pivot to a "product-led growth" mindset/orientation this year now that much of the hard engineering is done.

lwasser commented 1 year ago

@eriknw thank you for this feedback. so it sounds like it might be useful for @sneakers-the-rat @szhorvat to provide feedback from their perspective about what is confusing in the documentation. (such great questions you both asked erik!!) i wonder if there is some way for us to also help you with revisions if you need that support given your focus on maintenance docs (which i agree is profoundly important for long term maintenance). it's something that i've been thinking about but we don't have a huge team just yet to support that just yet...

tomalrussell commented 1 year ago

Hi all - just checking in with @sneakers-the-rat and @szhorvat - thanks for the engagement, discussion and issues opened so far. Are you both on track to submit your initial reviews by the 29th?

(For context - the pyopensci process aims for 3 weeks for review, 2 weeks for subsequent changes, and 1 week for reviewer approval of changes.)

szhorvat commented 1 year ago

I arrived back on Sunday and immediately came down with a bad flu, so I won't be able to finish by tomorrow. If I recover in time, I still hope to finish by Sunday.

tomalrussell commented 1 year ago

Hope you feel better soon @szhorvat! No pressure from my end, just helping to set expectations.

sneakers-the-rat commented 1 year ago

sorry the 29th has come and gone, i've been in a bit of a writing hole, just about to finish a draft of something this week and then was planning on finishing review next week!!!!

sneakers-the-rat commented 1 year ago

sorry again, still writing, just pinging to say i'm still here and that i'm going to put some time in on this right now!

sneakers-the-rat commented 1 year ago

Alright I made it through the checklist, there are some straggling thoughts in the remaining open issues. We can hash those out there and then I'll write my final review :)

szhorvat commented 1 year ago

Apologies about the long disappearance. I have not forgotten about the review. A series of unexpected events prevented me from completing it. I will be able to get back to it on May 1st Monday and hopefully complete it that week.

eriknw commented 1 year ago

I'd like to share most of the PRs that the pyOpenSci reviews have strongly influenced:

Since the review began, we have also:

The reviewers have also indicated where we can/should refactor code (operators!) to make it more understandable and maintainable, and specific suggestions for how to further improve documentation. I expect implementing these changes will keep us busy for most of the year.

@sneakers-the-rat @szhorvat if there are any specific issues you would like us to address before you finish your reviews (or as conditional for acceptance), please let us know and please be specific. There are many issues, checklists, and discussions for us to read that it's hard to know exactly where things stand (but the reviews have been great--thank you!). For example, I think a specific enhancement that y'all probably really, really want is for the operator objects to be listed in the API reference of the documentation (definitely a good idea). Should we add this "quick and ugly" right now, then redo it later in the year after we refactor operators to be more declarative, or just do it later after the refactor?

sneakers-the-rat commented 1 year ago

checking in to say sorry I have been AWOL, have not been feeling well the past several weeks, but am back and work and plan on finishing my review tomorrow or Monday -- I have no outstanding requested changes that need to be immediate (and if it isn't/wasn't clear I don't believe in a review process where it is possible to "fail" the review except by nonparticipation, just to take that idea off the table to the degree that it even applies here)

sneakers-the-rat commented 1 year ago

For example, I think a specific enhancement that y'all probably really, really want is for the operator objects to be listed in the API reference of the documentation (definitely a good idea). Should we add this "quick and ugly" right now, then redo it later in the year after we refactor operators to be more declarative, or just do it later after the refactor?

specifically on this - I would not want to encourage you to do anything that is quick and dirty or otherwise not part of your normal planned development process. I trust y'all at this point to make recommended changes as you are able to, and given that there are several ways to list those functions/methods/classes as is, and the docs are to some degree intertwined with the graphviz docs, I don't think their immediate absence is disqualifying to the "all user-facing functions documented" check.

eriknw commented 1 year ago

Thanks for checking in @sneakers-the-rat, and I hope you (and everyone) are feeling better.

I realize the timing on this review has been a bit... flexible, that March 29 (https://github.com/pyOpenSci/software-submission/issues/81#issuecomment-1458089302) has come and gone, and we're pretty far off the six week schedule that pyopensci aims for (https://github.com/pyOpenSci/software-submission/issues/81#issuecomment-1484897108).

If possible, can we refocus and try to get this review finished-finished before the SciPy conference, July 10th?

We've gotten a lot of value from this review already, and I appreciate all the time and effort the volunteers have spent. Thanks.

tomalrussell commented 1 year ago

Hi @eriknw thanks for your patience, I'm glad you've found value in the process so far, and thanks for the call to action!

I'm going to suggest a refocussed timeline, given the process has already had some time for review-in-progress and response:

sneakers-the-rat commented 1 year ago

SORRY YES i am still here i am just trying to catch up at work. i can do the 21st for sure. i will shoot for tomorrow.

sneakers-the-rat commented 1 year ago

TODAY is the 21st and i am completing it TODAY

sneakers-the-rat commented 1 year ago

Here are my final review comments, I'll also add them into my review above. Thanks again to the authors for their work here, it was a pleasure reading the package <3

Review

General Comments

python-graphblas is an exemplary package among scientific python packages, with excellent docs, tests, code quality, examples, and contributor engagement. The package is an opinionated wrapper and interface to GraphBLAS, which is well justified and differentiated from prior wrappers. Throughout the review process, the authors have been receptive to feedback and I have all faith that they will address any continuing suggestions I have in future development work. I have no hesitation recommending this package to anyone who wants to use GraphBLAS from Python.

My outstanding recommendations in the remaining open issues are all future suggestions that the authors can take or leave, none are mission-critical. I want to compliment the authors on this excellent work, I'm glad to have had a reason to have read it. I would be happy to respond to any questions the authors have about this review and otherwise continue to engage on open issues.

Code Quality

Library wrappers have their own sets of challenges and idioms, and python-graphblas handles them well with some room for future improvement. Wrapping code needs to make decisions about how to map the underlying API between languages and interact with the C library. The strategy used to abstract and expose the underlying library API can have large impacts on the maintainability and readability of the code, as can strategies for keeping the version of the wrapping package in sync with the changing API of the wrapped library.

python-graphblas further complicates these choices by admirably taking on the challenge of having multiple backend implementations (numpy and SuiteSparse:GraphBlas), and multiple export formats (NetworkX, Scipy.Sparse, PyData.Sparse, numpy, SuiteSparse, Matrix Market).

The python-graphblas team has chosen a bifurcated abstraction style which fits with the design of their wrapper. Rather than a transparent wrapper, the package introduces its own calling syntax to logically separate the i/o parts of a GraphBLAS call from the computation, which is well described and justified in the documentation.

At the time of reviewing,

This structure makes for a nice user-facing interface, being able to refer to classes and methods in a natural way (eg. gb.Vector, gb.unary.abs), obscuring the underlying code structure.

The tradeoff is the significant amount of indirection and implicitness in the code which presents a relatively high barrier to entry for new contributors. The mappings between the suitesparse GraphBLAS implementation and Python are programmed in several places, eg. within a method and then again in the tests, which the authors describe as being useful as a consistency check. Having some parts of the library dependent on import time side-effects is less than optimal and makes the code more difficult to reason about, but is certainly not fatal to the usability of the package. Again, the nature of a wrapping package requires making decisions about abstraction, so the only concern I have for the current code structure is the impact on maintainability. The current maintainers seem to have no trouble reasoning about the package, and I believe they are aware of these challenges and are actively working on them (they refactored a formerly massive operator file during the course of the review (https://github.com/python-graphblas/python-graphblas/pull/420)). The maintainers can do themselves and future contributors a favor by writing some additional developer docs that explain the structure of the library in greater detail, and I believe they will!

The other question I had was about the relationship between versioning in the API implementation in Python and in the underlying C library. GraphBLAS is a relatively mature API and seems to be only rarely changed, with care for backwards compatibility, so this is less of a concern than for a wrapper around a more actively evolving API. The authors have chosen to formally support a specific version of GraphBLAS ( https://github.com/python-graphblas/python-graphblas/pull/441 ) rather than build version-compatibility infrastructure within the package, which seems like an appropriate decision to me, given their other comments on how refactoring their backend system is on their development roadmap.

I want to also emphasize several of the "nice to have" features in the package, including a very impressive Recorder class that can keep a record of all the GraphBLAS calls made for debugging and reproducibility, automatic compilation of user-supplied functions including lambdas using numba, and the excellent i/o module. These indicate the authors are actively invested in user experience above and beyond the tidy API they expose.

Aside from the above comments, the fundamentals of the package design are strong: modern packaging with setuptools, automated code quality checks, CI, and coverage reports. The decisions made in code structure seem principled, responsive to the constraints of the problem, and result in a very usable user interface - my compliments to the Authors.

Docs

The package is well documented from an introduction to the problem that GraphBLAS attempts to solve through package design decisions and practical examples. It is suitable for a general audience with some exposure to graph operations and programming in python, which is impressive given the highly specialized nature of the library.

Future suggestions for the authors include embedding their example notebooks in the documentation, and improving the API documentation. Currently, I assume due to some of the abstraction decisions made in the rest of the package, there is not comprehensive documentation of every operation available to the user that might otherwise be accomplished with calling autodoc over a class or a module, but some common operators are listed here and the list of available operators are present in the GraphBLAS documentation as well as through object inspection within an interactive Python session. Given the nature of wrapping code where the underlying operations are well documented elsewhere, this is less of a problem than it would be in other packages.

Altogether the docs are excellent with several clear points of improvement, but far above average in the landscape of scientific python packages.

Tests

Who among us can claim 100% test coverage? https://coveralls.io/github/python-graphblas/python-graphblas

The tests are well organized and comprehensive, and I was able to find a corresponding test for every package feature I looked at easily. In more security-sensitive contexts one would want to do more adversarial input fuzzing, but I don't think that's all that relevant here since I've never seen graph data analysis libraries used as a malware vector. I have no notes on the tests, this is good work.

szhorvat commented 1 year ago

Thanks for the patience everyone.

tl;dr This is a very nice package, technically sound, with a well-thought out Pythonic interface. My perception was that in order to bring GraphBLAS to Python users (and thus fully realize its promise of implementing graph algorithm in terms of high-level building blocks), what we need the most is Python-centric documentation and training materials.

Introduction

First of all, let me say that python-graphblas is a well thought out and technically solid package. It provides a Pythonic interface to the C-based GraphBLAS API, currently supporting SuiteSparse:GraphBLAS (which is the only usable GraphBLAS implementatin available at the moment).

The review should concern python-graphblas, but it's impossible to talk about it without also discussing GraphBLAS itself. And here I must make it clear that I am a newcomer to GraphBLAS. GraphBLAS is a sandardized C API aiming to provide high-level yet general building blocks for graph algorithms, similarly to how BLAS and LAPACK provide building blocks for linear algebra. It does this through some very elegant math, employing the language of linear algebra and abstract algebra, generalizing the usual (+, *) operations to a wider class of semirings. For example, with the usual (+, *) matrix product the kth power of a graph's adjacency matrix gives the number of walks between any two vertices. Replacing + by min and * by + will give us shortest path lengths up to k instead. What is not yet clear to me where the limits of this approach are: Are there common graph concepts that cannot be expressed in this framework? Are there some which can be expressed, but the most efficient algorithms to compute them can't be expressed with GraphBLAS?

Documentation and target audience

python-graphblas is realizing one of the major advantages of the GraphBLAS approach: it is now possible to implement graph algorithms in a convenient and high-level langauge like Python without the need for explicit loops, and thus without a significant loss of performance. Indeed, the python-graphblas authors have also created the graphblas-algorithm package, which achieves performance comparable to C/C++ libraries while being written in pure Python. (To be fair, it's not clear to me how much of the performance comes from parallelization, but the benchmarks are impressive.)

This brings me to a point about how python-graphblas is presented. This is the first paragraph in the docs:

python-graphblas is a Pythonic interface to the highly performant SuiteSparse:GraphBLAS library for performing graph analytics in the language of linear algebra.

This makes the impression that python-graphblas is just an interface to a C library, perhaps even aimed at people who already understand that library. I think this is backwards: if one of the main GraphBLAS advantages is usability from high-level languages, then the target audience should be users of such languages. It seems to me that a project like this should spend at least as much effort on good documentation and training materials as on technical bindings. And this is precisely the weak point of python-graphblas. It is impossible to learn the system without referring to external material that describes the C API. In fact the competing package pygraphblas seems to be doing a little bit better on this front, though still not well enough for users to be able to avoid C-based documentation.

My main recommendation---and I realize that this is a long-term project---is making significant improvements to the documentation with the pure-Python user in mind.

In the intro:

More in depth:

Interoperability

In addition to the technically sound and well thought out GraphBLAS interface, the package contains functionality for interoperability with graph theory / network analysis tools, as well as some visualization tools. The quality of these is in need of some improvement:

I opened some visualization issues, https://github.com/python-graphblas/python-graphblas/issues/473 https://github.com/python-graphblas/python-graphblas/issues/474 https://github.com/python-graphblas/python-graphblas/issues/475

All this said, interoperability and visualization are not core functionality. As I see it, they are for convenience, and the package should not be judges based on these.

tomalrussell commented 1 year ago

Thanks for the above, @sneakers-the-rat and @szhorvat !

@eriknw @jim22k @SultanOrazbayev - recognising that you've been engaged with the review already, and that some things may be pushed to a longer timeframe, are you happy to respond and make any priority changes by the end of the week (30th June)?

eriknw commented 1 year ago

I applaud the reviewers 👏 . I never expected such thorough and honest reviews. Thank you for positive feedback and the criticisms. I agree with all of it--I think you nailed both the strengths and the weaknesses. The reviews are valuable and I suspect will help shape our vision and effort for the next couple of years.

In particular, there are two main areas we need to give more attention:

Heh, this is probably true for many projects. My main focus in the near and medium term will be maintainability.

Now if I may ramble on for a bit...

It's interesting that reviews occur at a specific moment in time. I know the history of python-graphblas, and, trust me, it has evolved significantly in every 6 month period of its 3.5 year lifetime. If we had been writing pristine user-facing documentation from the beginning, it would have needed to be rewritten and revised endlessly. Functionality would probably be 1.5-2 years behind where we are today. If you're curious, go back and look at versions around 1.3.8 to 1.3.14. It's recognizable, but so different and missing so, so much!

Anyway. I want to highlight this comment:

if one of the main GraphBLAS advantages is usability from high-level languages, then the target audience should be users of such languages

Absolutely! I agree 100%. We aspire to this. It will take time.

From a product perspective, I/we wanted to get the syntax and functionality stable enough to begin writing graphblas-algorithms in earnest. We are targeting networkx users and are adding dispatching to networkx.

Oh, and if you're curious why our test coverage is so high, it's' for multiple reasons:

Wrapping up... I think we have replied to all open issues from the review. They will keep us busy, that's for sure.

@jim22k @SultanOrazbayev want to say anything else? I don't think it's necessary for you to comment here.

Thanks again all, hope to see you around ❤️ !

tomalrussell commented 1 year ago

Thanks @eriknw (also for the lovely tone and for the background and bit of history)!

@szhorvat and @sneakers-the-rat for complete clarity, can you confirm you're happy with responses?

szhorvat commented 1 year ago

Yes. The suggestions I made are mostly for the long term.

sneakers-the-rat commented 1 year ago

Same, full approval from me :)

tomalrussell commented 1 year ago

Thanks all, time and effort very much appreciated. The review process is done!

All that's left is to wrap up, publish the version of record and acknowledge all your contributions.


🎉 python-graphblas has been approved by pyOpenSci! Thank you @eriknw for submitting python-graphblas and many thanks to @szhorvat and @sneakers-the-rat for reviewing this package! 😸

Author and Reviewer Wrap-Up Tasks

There are just a few things left to do to wrap up this submission, @eriknw, @jim22k, @SultanOrazbayev:

Both reviewers and maintainers (@sneakers-the-rat, @szhorvat too):

Editor Final Checks

Please complete the final steps to wrap up this review. @tomalrussell, please do the following:


If you have any feedback for us about the review process please feel free to share it here. We are always looking to improve our process and documentation in the peer-review-guide.