openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
699 stars 36 forks source link

[REVIEW]: retworkx: A High-Performance Graph Library for Python #3968

Closed whedon closed 1 year ago

whedon commented 2 years ago

Submitting author: !--author-handle-->@IvanIsCoding<!--end-author-handle-- (Ivan Carvalho) Repository: https://github.com/Qiskit/rustworkx Branch with paper.md (empty if default branch): joss-paper Version: 0.12.0 Editor: !--editor-->@drvinceknight<!--end-editor-- Reviewers: @szhorvat, @inakleinbottle Archive: 10.5281/zenodo.7158473

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/d88f8b00bb1c5dba0310a8e41a651732"><img src="https://joss.theoj.org/papers/d88f8b00bb1c5dba0310a8e41a651732/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/d88f8b00bb1c5dba0310a8e41a651732/status.svg)](https://joss.theoj.org/papers/d88f8b00bb1c5dba0310a8e41a651732)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@szhorvat & @inakleinbottle, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @drvinceknight know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @szhorvat

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @inakleinbottle

✨ Important: Please do not use the Convert to issue functionality when working through this checklist, instead, please open any new issues associated with your review in the software repository associated with the submission. ✨

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 2 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @szhorvat, @inakleinbottle it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 2 years ago

PDF failed to compile for issue #3968 with the following error:

 Can't find any papers to compile :-(
whedon commented 2 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.33 s (820.7 files/s, 140199.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                         104           2744           2774          16207
Rust                            49           1184           6003          12766
YAML                            86            189              9           1827
reStructuredText                 7            220            340            684
Markdown                         8            151              0            505
HTML                             3             42              5            281
CSS                              3             20              8            206
INI                              1              9              0             74
TOML                             4             12              0             71
Bourne Shell                     4             15             28             40
JSON                             1              0              0             34
Jinja Template                   1              0              0              8
TeX                              1              0              0              8
-------------------------------------------------------------------------------
SUM:                           272           4586           9167          32711
-------------------------------------------------------------------------------

Statistical information for the repository 'c54d7fecb53240c22d5bf27c' was
gathered on 2021/12/01.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Bhaavan Goel                     3           329              0            1.10
Hoss Ajallooiean                 1           112              0            0.37
Ivan Carvalho                    9          4090           2211           21.06
Jielun (Chris) Chen              2          1198              2            4.01
John Lapeyre                     1           247              0            0.83
Lauren Capelluto                 1            73              0            0.24
Matthew Treinish               182         13707            707           48.18
MoAllabbad                       2           107              0            0.36
Nahum Rosa Cruz Sa               6          2655            809           11.58
Salvador de la Puent             1            22              2            0.08
Seaton Ullberg                   1            10              0            0.03
Soham Pal                        1            79              0            0.26
dependabot[bot]                  1             3              3            0.02
gadial                           2            67              2            0.23
georgios-ts                     20          3123            361           11.64

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Bhaavan Goel                303           92.1         12.8               10.89
Hoss Ajallooiean             88           78.6         12.6               12.50
Ivan Carvalho              4072           99.6          6.4                1.87
Jielun (Chris) Chen        1256          104.8          6.4                3.66
John Lapeyre                234           94.7          3.2                9.83
Lauren Capelluto             44           60.3         21.6               34.09
Matthew Treinish          10897           79.5          9.6                8.66
MoAllabbad                  106           99.1         13.4                9.43
Nahum Rosa Cruz Sa         1809           68.1          5.1                5.53
Salvador de la Puent          8           36.4          9.4                0.00
Seaton Ullberg                5           50.0         13.3                0.00
Soham Pal                    79          100.0          1.6               13.92
gadial                       23           34.3         19.3               13.04
georgios-ts                2801           89.7          4.1                5.89
szhorvat commented 2 years ago

Since JOSS reviews are so different from the typical peer review, I'd like to let you know in advance how I plan to proceed, just to make sure we are all on the same page. Editors, authors, please let me know if I should change this approach.

I will prefix the titles of issues I open in the retworkx repository with [joss]. The fact that I opened an issue doesn't mean that it must be fixed before I would recommend the paper for publication. It will sometimes be meant as a discussion starter, to help me understand the library and its limitations. I understand that not every suggestion I might make would be reasonable to implement before publication. I would however appreciate responses to all the issues I open.

I will post things that concern only the paper (and not the library) directly here. The same with things that I believe concern the editors.

I will make it clear here when I consider the review completed. I won't be able to do it in a single go. Instead, I expect to have an ongoing conversation between the reviewers and authors for a couple of weeks. I will sometimes wait for answers to my comments/questions before proceeding. This is a large library and it will take time to get to know it.

It is the first time that I write such an introduction. I decided to do it because it has happened in the past that people seemed to have differing concepts about how reviews are carried out, and communication wasn't as effective as it could have been.

szhorvat commented 2 years ago

And thus we arrive to my first comment. This checklist item isn't fulfilled at the moment:

  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Can you please let me know how I can seek support? I expect I will have lots of questions which you may not want to see in the issue tracker (not bug reports or feature suggestions). Please also make sure that the proper support channels are clearly communicated to users.

mtreinish commented 2 years ago

@whedon generate pdf from branch joss-paper

whedon commented 2 years ago
Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...
whedon commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

mtreinish commented 2 years ago

And thus we arrive to my first comment. This checklist item isn't fulfilled at the moment:

  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Can you please let me know how I can seek support? I expect I will have lots of questions which you may not want to see in the issue tracker (not bug reports or feature suggestions). Please also make sure that the proper support channels are clearly communicated to users.

We have a public slack channel as well as an IRC channel on oftc (the irc channel is empty except for me right now). I'll push a PR that adds a section to the bottom of the readme shortly that mentions these.

mtreinish commented 2 years ago

I just remembered we do briefly mention the slack channel in our contributing documentation already: https://github.com/Qiskit/retworkx/blob/main/CONTRIBUTING.md#module-directories-when-a-single-file-is-not-enough (which also gets put in our published documentation: https://qiskit.org/documentation/retworkx/CONTRIBUTING.html?highlight=slack#module-directories-when-a-single-file-is-not-enough) but I think that should still probably be mentioned in the README as well.

mtreinish commented 2 years ago

I opened the PR for this here: https://github.com/Qiskit/retworkx/pull/501

szhorvat commented 2 years ago

@mtreinish Thanks for the response. I will join the Slack space, but I would prefer to also have the option of using some asynchronous communication channel that's public and keeps a record of the conversation (for the sake of the review). Is there a place where I can ask my questions that way? Should I just use the issue tracker? Have you considered GitHub Discussions? It's just an idea.

I just remembered we do briefly mention the slack channel in our contributing documentation already

The support channels should be communicated clearly to normal users, not just would-be contributors. The README is a much better place for this. The PR looks good!

kthyng commented 2 years ago

@szhorvat If I follow your comments correctly, I think you should feel free to use the issue tracker for your review comments. (I'm the rotating associate editor in chief for the week and trying to answer questions.)

szhorvat commented 2 years ago

For reference, the paper is here on arXiv: https://arxiv.org/abs/2110.15221

szhorvat commented 2 years ago

Regarding the name retworkx: it obviously references networkx, which is by far the most popular graph library in Python. People are going to assume some kind of relationship, but I don't see what it is. Can you please clarify?

The API is different (not a replacement), the projects appear unrelated, and retworkx doesn't appear to be closer to networkx than any of the other Python network libraries. I am wondering if you have contacted the networkx project regarding the naming, and if they are happy with it.

I am not sure if this is within the scope of a JOSS review, but I thought it important to mention.

szhorvat commented 2 years ago

I have now read the software paper, and I believe it needs a bit of improvement. In my view, the purpose of a JOSS paper is so that readers can get answers to questions such as:

As the paper currently stands, these questions are not fully answered.

Statement of need

Here, the authors state their own need, when working on the Qiskit project (networkx was used originally, but it was too slow, so retworkx was created to replace it). I would expect to read about the target audience's need. The broader target audience was set in the abstract: retworkx is "a flexible graph and network analysis library". I believe that the majority of people who are interested in such libraries will not have heard of Qiskit, thus talking about the needs of Qiskit will not be meaningful to them.

In short: What so-far unmet need does retworkx aim to fulfil in the field of graph and network analysis?

Related work

In this section, the authors list a few (but not all) alternative libraries, and simply state that they were not suitable for the needs of Qiskit. Some of the issues are:

I should note that the landscape of graph libraries is very crowded in Python. It might seem like it is hard to stand out. Yet, as a co-developer of igraph, it is quite clear to me that it is nearly impossible to create a truly all-purpose graph library. There will always be tradeoffs, and between the many library choices there are gaps to fill in. But which of the gaps does retworkx fit into? I cannot tell after reading the paper.

To give some examples of what could be discussed, some libraries are targeted at graph theory researchers (mathematics). SageMath should be mentioned here. Some others (including most of those mentioned in this section) are mainly for complex network analysis, a very different field. In both these fields, users are likely to work with static graphs and use existing algorithms instead of implementing new ones. In contrast, some libraries may strive to provide a data structure for implementing efficient graph algorithms. There are many possible representation of graphs on a computer, all coming with their own tradeoffs: e.g. efficient modification is in conflict efficient graph traversal. Some libraries even provide multiple underlying representations in an attempt to deal with this.

Please clarify these questions in this section. SageMath and NetworKit should be mentioned. I would like to bring the authors' attention to LEMON graph as well, which stands apart from most other libraries with its focus on optimization and enabling the implementation of efficient algorithms. LEMON is C++ but has a (now abandoned?) Python interface. The decision to write everything from scratch, as retworkx seems to do, is a major one. It will take a very long time to catch up with other libraries, functionality-wise. It would be good to discuss why this decision was made instead of building on top of an existing solid library such as LEMON.

Graph data structures

This is a good section because it aims to explain some of the conceptual foundations. It could be improved by adding some examples (e.g. to clarify what "callbacks" are). I suggest to take the opportunity to also explain the core capabilities of the library: libraries may support simple graphs, multigraphs with self-loops, directed/undirected or even mixed graphs, weighted graphs, vertex and edge attributes. Which of these features are or aren't supported?

It would be nice to touch on the tradeoffs I mentioned in the previous section, such as tradeoffs between efficient modification and traversal. How is the graph represented internally?

Use Cases

This section is in fact about benchmarks that compare with igraph, graph-tool and networkx. It would be good to also discuss some actual potential use cases, and mention examples of what tasks one may solve with the library.

The benchmarks are hosted in a different repository under a different user: https://github.com/mtreinish/retworkx-comparison-benchmarks If the paper gets accepted, the current contents of the main repository will be archived along with the paper. With this organization, it looks like the benchmarks might be left out, which will be a problem for future reproducibility. This should be remedied. (@drvinceknight, correct me if I'm wrong about this.)

In fact, some of the benchmarks are flawed (I will open issues for this later—but in which repository should I do this?). If a reader would only see the benchmark results, without an opportunity to look at the benchmark code as well, they will come away with an incorrect impression ... all the more reason to bundle the code.

It's a bit unclear what is being compared between libraries. How fast they accomplish a task (with whatever algorithm they have implemented)? Or how fast the implementation of a specific algorithm is? Sometimes different algorithms are used (e.g. VF2 vs VF2++).

The benchmark intention should be made clearer. The benchmark tasks should be described in more detail, e.g. with shortest paths: Weighted or unweighted? Directed or undirected? These do make a difference in some libraries. Single-source shortest path (implying all other vertices are targets) or the path between one source and one target (the search may be terminated as soon as the target is reached)?

There are only three benchmark tasks. What do we learn about the libraries from these? Why were these specific tasks chosen, and do you expect the performance findings to generalize to other tasks? In the case of subgraph isomorphism, I would say no. For shortest paths, they might, as many typical calculations on graphs rely on the same kind of graph traversal as in shortest path finding. However, the shortest path benchmark might be backed up with some others that exercise the library in similar ways (e.g., betweenness).

SNAP is omitted from the comparison "because its Python wrapper did not contain the required functions" (see footnote). Does this refer to VF2? If so, I do not think this is a fair reason to exclude it. Subgraph isomorphism is rarely needed for practical network analysis, and the fact that it's not implemented shouldn't immediately disqualify a library even from the other benchmarks.

szhorvat commented 2 years ago

Graph Creation benchmark

This benchmark is severely flawed.

The igraph "graph creation" benchmark actually first builds a networkx graph then converts that to an igraph object. This doesn't actually measure igraph's performance, and it's no wonder that igraph ends up looking slower than networkx ... https://github.com/mtreinish/retworkx-comparison-benchmarks/blob/main/igraph_bench/gr_parser.py#L65

With igraph, the correct way to create a large graph like this is to first build an edge list as a list of tuples of vertex indices, then call igraph.Graph(edges, n=vertex_count) in one go. Do not call .add_edge() repeatedly, as this operation takes time proportional to the size of the graph. igraph is designed to analyse large static network efficiently, but it does not currently support efficient modification.

Note that building the graph this way may take less time than reading the entire huge gzipped file line by line using Python. Thus, it makes sense to separate the timings for reading the file vs building the graph. This brings us to the question: what does it even mean to time "creating a graph from an edge list"? The performance will depend on how that edge list was originally stored, and whether it first had to be converted to a suitable input format for a given library.

szhorvat commented 2 years ago

@drvinceknight: This is my final report.


In the past few days, I have been thinking about how to go forward with the review. My opinion is that the submission to JOSS was premature. This package certainly shows promise, and I think it would make sense to resubmit it in the future, maybe in a year's time. However, trying to work out all the issues within the timeframe of the review does not seem realistic. It would put too much pressure on the authors, as well as on the reviewers.

Software paper

Please refer to my comments above. In short, as currently written, the paper does not provide enough information to readers to decide whether the library would be useful for them.

I leave the checklist items for "Software Paper" unchecked for now (quality of writing and references would need to be re-checked anyway once the paper is updated).

Benchmarks

The library appears to be fast, but there are two issues with the benchmarks, as presented:

More comments are above. I left the Performance item unchecked on the list for now.

The library

According to the submission, retworkx aims to be a new general-purpose Python library for working with graphs. In my comments about the paper, I discussed how different use cases could be targeted within this broader goal. Possibilities are:

It seems to me that at this time, the library is not practically usable for either of these three tasks. There are too many functionality gaps. For example, even decomposing to connected components is not implemented for undirected graphs (only for directed ones). I quickly found bugs in fundamental functionality, such as computing degrees (now fixed). The core API misses features critical for implementing graph algorithms (e.g. getting incident edges). Other examples are in the issues I opened.

I could not check off the Functionality item at this point.

The documentation is somewhat haphazardly written, with many typos and sometimes unintelligible descriptions. The same applies to the naming used in the API, e.g. node_indexes vs edge_indices, a continual conflation of the terms "weight", "payload", "edge" (all three of which sometimes refer to the very same thing), etc. The impression one gets is that even fundamental parts library were put together in a bit of a rush, in response to an immediate need for a specific application within Qiskit.

I could not check off several of the Documentation items.

All that said, the library is built on solid foundations (the petgraph Rust library), and has good potential to grow and progress.

Advice to the authors

While the landscape of Python graph libraries is crowded, there is definitely space for newcomers. However, new libraries should have a vision that distinguishes them. Which gap are you trying to fill? Think about what you are trying to achieve, and instead of letting the library grow organically (one might say haphazardly), think about a coherent design that fits the proposed use case.

It seem to me that retworkx's biggest problem is that it has a single driver: being a component in Qiskit. If it does just that—serve very specific needs within a single project that does not even deal with networks—it is unlikely to become useful to a broader audience, and would not be publishable. I elaborated on this a bit in my comments about the software paper. Let actual general use cases be the driver behind development. Would you like it to be useful for analysing network data? Then you, the developers, should complete a number of network analysis projects. That will show you what is missing (e.g. can you do the first step, import data from common formats?) Talk to people who work in network science. Do you want it to be useful for prototyping graph algorithms? Then do that: implement algorithms not in Rust, but in pure Python using retworkx. Graph theory? Read Math.StackExchange or even MathOverflow, find computational tasks, see if you can solve them.

Finally, you have the advantage of starting from scratch. You are not carrying all the heavy historical baggage of other graph libraries like igraph or networkx. First, understand the other libraries' limitations, understand the mistakes they made, learn from them. Come up with a better design. I see functionality being piled on, but very little design. Why are you foregoing this unique opportunity? To be frank, the software paper gave me the impression that retworkx was born out of a "not invented here" mindset, not from a detailed understanding of what is missing in existing Python graph libraries.

mtreinish commented 2 years ago

First, I want to thank you for your detailed look at the retworkx so far. I was hoping the paper review would serve as a chance for people with different sets of experience and viewpoints to review the library and point out gaps and all the issues you've opened will help us make a better library. One of my personal goals with retworkx recently is to expand it's usefulness for more domains including a more traditional network analysis and getting your perspective on this has been quite valuable and you've identified a lot of places we can make improvements. But, I think because your critique here is coming from a completely different viewpoint you're inadvertently missing the use cases for which retworkx was developed and the value it's providing. What you're attributing to a lack of good design are just different priorities and design decisions we've made in the development of retworkx. This doesn't preclude making the changes you're looking for, but it also doesn't invalidate what has been done so far or how retworkx is used today.

In the development of retworkx so far we've been more focused on applications that use graph or network analysis as a component or tool for a specific larger goal or application. This is why we've designed retworkx to be dynamic and flexible in its use. While retworkx was originally designed to meet Qiskit's needs from a graph library, Qiskit is not a single application or exclusively what retworkx can be used for. More specifically, Qiskit is a constellation of different software projects each working on a different aspect of quantum computing, including applications of quantum computing. The original use case for retworkx was in the internal use by the compiler where a directed acyclic graph is used to represent a quantum circuit (i.e. a program for a quantum computer). The compiler analyzes that graph and performs transformations on it to fit and optimize the input program to the target QPU. This is where we hit scaling limits, which prompted the development of retworkx. In this application dynamically building and modifying large graphs with thousands of nodes fast is very important as it's the bottleneck for compilation. Another application of graphs in Qiskit's compiler uses a directed graph to model the connectivity of qubits on a QPU , which is one of the hardware constraints the compiler needs to adapt an input program to. An example application of this connectivity graph we recently just developed was a new compiler technique applying subgraph isomorphism to attempt to find a perfect mapping of the qubits used in a quantum circuit to the physical qubits in a QPU. But this is just in the base layer of Qiskit (the interface between software and hardware) and there are higher level applications of quantum computing which are using retworkx in different manners. For example, recently the Qiskit Nature project, which is concerned with applications of quantum computing for natural sciences, started using retworkx recently to create lattice models to perform physics simulations.

Looking outside of Qiskit there are additional applications currently using retworkx such as qtcodes, which is a downstream user of Qiskit and retworkx directly. qtcodes is using retworkx to research new techniques for quantum error correction (you can see some examples of this here: https://github.com/yaleqc/qtcodes/blob/master/tutorials/rep/2-fitters.ipynb but I'm not an expert in QEC and won't be able to provide an adequate explanation of how it works). Another research project using retworkx is atompack (who were a very early user of retworkx) where they use retworkx to create graph models of atomic structures. Both of these (which we cited in the paper) are using retworkx to generate graphs dynamically as part of a larger application.

As for things you highlight like inconsistencies in the API, bugs in certain functions, or issues in the documentation, many of those are valid things to fix, and highlight why getting multiple sets of eyes on software is so useful, and why I value this review process. But, for some of the issues you've outlined I think this is another place that highlights where we have a difference in priorities. One of the things we value strongly in retworkx is backwards compatibility. While the project may have only existed for a couple of years we already have a large user base both via Qiskit but also outside of it and we can't simply change an API to make naming consistent or change documented behavior without potentially breaking people's applications. For example, one of the quirks you've highlighted, the difference between node_indexes() and edge_indices(), is a known issue and dates back to one of the earliest retworkx releases. We've settled on the name indices in the API and documentation and that's what we've been using whenever we add new functionality that consumes or returns multiple identifiers. But, the node_indexes() method existed prior to that decision and simply renaming that function would break current users which isn't acceptable to us. We have to make any change like that gradually to not disrupt existing working applications and enabling people to migrate to a new node_indices() method. But, making that transition hasn't been a priority as it has real cost for users and most people can accept this and understand what both methods do.

For the paper, part of why we didn't go into too much detail there is partially because of the length limits. Reading the expectations of the JOSS paper we were cautious to try and avoid explaining all the internal details of the library and saving that for the documentation. The other aspect was we were trying to avoid putting any project in a negative light. We felt it was important to showcase how retworkx's performance compares to other popular graph libraries for python because the we felt the performance characteristics of the library is one of the key reasons you would use retworkx. But, we weren't out to paint any particular library in a negative light as they're all useful for different applications and tried to avoid going into too deep an analysis or comparison because the purpose of the paper is not be a survey comparing in detail different graph libraries in Python. But, it is also difficult to explain all these nuances in just ~1000 words. We'll try to tweak the paper based on your suggestions and update the review here with an updated draft when it's ready.

For the specific thing about the creation benchmark, you're right we shouldn't have used the networkx converter there with igraph. This was done mainly because creating a graph with igraph in the same manner as all the other libraries benchmarked was exceedingly slow (taking hours to create the graph). As we're not experts in igraph we referred to the documentation and there was no reference that we could find mentioning what you mentioned with the performance characteristics of the add_edge() method. We used networkx and converted to igraph because we wanted to have numbers in the same ballpark (instead of having it show as orders of magnitude slower). But you're right that's not a good demonstration of how igraph performs. As for the validity of the benchmark I have to disagree with you there, the benchmark is a valid use case and one we value heavily in Qiskit's use of retworkx. Being able to iteratively create a graph quickly is important in some applications, especially those that are using networks as part of a larger computation for analysis where the graph is dynamic. Not every application is able to statically create a graph all at once from a static edge list and then perform analysis. Looping over an input file like that and creating the graph one edge at a time showcases the time it takes to build a graph iteratively. But I agree we didn't do the best job explaining this in the paper, we'll update the wording to make it a bit more clear what we're trying to show with that benchmark. As for igraph we'll update the graphs to just not include igraph and mention it's not suitable for applications where you need to dynamically create large graphs iteratively.

szhorvat commented 2 years ago

Let me sum up my recommendation, to avoid misunderstandings.

There is no doubt that retworkx shows a lot of promise, but it does not look mature enough. When trying to use it for various tasks, its limits are reached very quickly and the functionality gaps become apparent. It was easy to find bugs, which again suggests that it has not been thoroughly tested through real-world usage not related to Qiskit and its dependents. I think resubmission at a more mature stage would make sense.

At that time, it would help to (1) state the scope clearly (2) demonstrate use cases within the scope. As a comparison, another library I reviewed recently has several worked examples in its documentation which shows that it can be used to solve new tasks within its scope. The paper word limit should not be an issue since many things are better placed in the documentation anyway.

All the issues I brought up are fixable, and I hope to see retworkx succeed and eventually published in JOSS. However, in my opinion, this would take much more time than what is reasonable during a review. I do not think that it is in the authors' interest to try to address all concerns under time pressure, especially since parts of the library already make the impression that they were created in a rush. Also consider that retworkx, like most graph libraries (and perhaps unlike most JOSS submissions), has a very large scope, thus it simply takes more time and effort to bring it to a reasonable level of completion.

I will address some of the specifics in @mtreinish's response at a later time.

inakleinbottle commented 2 years ago

@drvinceknight I don't seem to have the ability to update the checklist in the initial issue post.

I have completed my initial review of the software and paper.

General comments

I'm glad to see Rust being used as a safe and fast back-end for serious computation, and I have no doubt that this fulfills the needs of the Qiskit project. However, I too share some of the concerns of the other reviewer about the utility of the library for a more general audience, and to the benchmark methodology.

A cursory look at the reference seems to suggest that retworkx has some support for other graph algorithms such as cutting and colouring, but there are not mentioned at all in your benchmarks, and the support looks relatively limited as compared to networkx, igraph, or graph-tool. To be clear, I would not expect feature parity with these very mature libraries and this is not a criticism that I put much weight behind, but I would perhaps like to see benchmarks for all of the functionality you have (and that are stable) against these other libraries to show some direct benefit of retworkx over other tools. The reason I mention this is that the name of the library "retworkx" implies a fast alternative to "networkx", but in reality only a small subset of the functionality is represented as of yet.

Documentation

The statement of need suggests a need for a drop-in replacement for the networkx library in which the core algorithms and data structures are passed to a compiled binary to do the actual work, thus improving the performance with minimal changes to the Python API. I agree that there is a place for such a library in the current ecosystem, and I am happy to accept that the external API of the other graph libraries mentioned do not fulfill the requirements for which retworkx was designed. I cannot comment on whether more general users of graph libraries in Python would find this to be true.

The installation instructions provided in the documentation are excellent and, since this is distributed as a wheel containing the compiled binaries on my architecture and OS, I had no problems installing the library for testing. (This should be the case for most users, as wheels are available for Python 3.6-3.9 on most of the major architectures. However, since Python3.10 was release recently, the authors may wish to consider adding a wheel for Python 3.10 too.) As for compiling the library from source, I have some concerns that I will detail below.

The API documentation is fine, and the core classes have plenty of example usage, though I have not checked exhaustively. However, the algorithm functions have rather sparse documentation. It would be good to have at least one example usage in each of the main algorithm methods, even if this is a trivial example, so users can see what a typical call might look like. I would like to see more examples from a top level that walk through how to use the library to construct, manipulate, and use graph objects in a more complete pipeline; I have always found such "tutorial" style documentation to be a great way to cover a great deal of the usage without users needing to dig into API documentation.

The library has a suite of tests, which as far as I can see test results against networkx (again, I have not checked every test file). This is fine, assuming that networkx does not contain any bugs. I would suggest, if there are not already, adding tests that explicitly construct correct results to compare computed results against rather than relying on networkx to test the correctness.

You have documentation about how people might contributed to the library, but nothing to suggest how users should raise issues. I assume, since the project is hosted on GitHub, that they should create an issue on the repository. I would add a small section to the end of the Readme document to confirm this, or point users to the place that they can file a bug report. (This is set on the "bug tracker" attribute on PyPI.)

Functionality

As I mentioned, I had no problems installing the library from wheel. However, I noticed that the setup.py file requires the rust-specific setuptools addon yet this is not listed as a build-time dependency. Rather, the setup.py file attempts to load this addon, and if it fails uses a subprocess call to pip to install the necessary package and then attempts to proceed. I understand why the authors have structured the code in this way, but I don't think this is a good idea. First, the pip-install step might fail and then users would get a cryptic error message when the setup resumes. Second, some users might be using tools to scan dependencies, and this would be missed. This is something the setup.py file format doesn't address well, and I can't offer a better solution.

As for functionality, there are some small inconsistencies between the interfaces of retworkx and it's inspiration networkx that users might find a little frustrating (though nothing, as far as I can see, that would cause problems). At present, retworkx does not cover the same range of functionality as its competition, though that could hardly be expected (as mentioned above, those libraries are very mature).

I am in agreement that the benchmarks on performance leave something to be desired, though I have not delved into them quite as deep as the other reviewer. I won't reiterate the comments about construction benchmarks, though these do need to be addressed. Instead, I will mention that the benchmarks should cover a wider range of use cases. The graphs used in benchmarking are generally pretty large. I would suggest that a large number of graph library users would primarily work with graphs that are relatively small by comparison (a few hundred or fewer nodes). Your benchmarks don't address this end of the spectrum at all. I would consider adding benchmarks for these cases too, although I expect the performance gap to be considerably closer here.

Software paper

Generally I don't see any issues in the software paper. The summary and statement of need are clear. The code for the benchmarks should be included in the main repository, so it is more strongly linked to the version of retworkx. I suggest making a folder called "benchmarks" and placing the code in there.

whedon commented 2 years ago

:wave: @szhorvat, please update us on how your review is going (this is an automated reminder).

whedon commented 2 years ago

:wave: @inakleinbottle, please update us on how your review is going (this is an automated reminder).

mtreinish commented 2 years ago

However, since Python3.10 was release recently, the authors may wish to consider adding a wheel for Python 3.10 too.

We've already setup CI on the main branch for 3.10 support (https://github.com/Qiskit/retworkx/pull/447) and we're going to be publishing wheels for 3.10 in the next releases (which has unfortunately been delayed a bit, we were originally hoping to have it out a month or 2 ago).

As I mentioned, I had no problems installing the library from wheel. However, I noticed that the setup.py file requires the rust-specific setuptools addon yet this is not listed as a build-time dependency. Rather, the setup.py file attempts to load this addon, and if it fails uses a subprocess call to pip to install the necessary package and then attempts to proceed. I understand why the authors have structured the code in this way, but I don't think this is a good idea. First, the pip-install step might fail and then users would get a cryptic error message when the setup resumes. Second, some users might be using tools to scan dependencies, and this would be missed. This is something the setup.py file format doesn't address well, and I can't offer a better solution.

This is mostly a holdover for users on older versions of pip, we have setuptools-rust listed as a build system requirement in the pyproject.toml: https://github.com/Qiskit/retworkx/blob/main/pyproject.toml#L2 (which hopefully scan tools look at) which is the only place we can list it and get it installed early enough (setup_requires in the setup.py is too late). But for users on older pip versions that didn't fully support PEP 517 this was insufficient as pip wouldn't install the build system requirements and that would result in a cryptic import error. So the compromise to avoid this was to pip install in the setup.py. I agree pip installing as part of the setup.py isn't a great approach here and the need for this has mostly disappeared since RHEL/Centos 6 is EoL which is mostly where we got users with ancient pip versions anyway. I think we can probably get away with removing it from the setup.py and improving the documentation in the readme to explicitly list upgrading pip or manually installing setuptools-rust. I've pushed a PR proposing these changes here: https://github.com/Qiskit/retworkx/pull/533

szhorvat commented 2 years ago

As promised, I will respond to some of the specific points that @mtreinish made. Regarding the reminder, my review should be considered complete.


But, I think because your critique here is coming from a completely different viewpoint you're inadvertently missing the use cases for which retworkx was developed and the value it's providing.

I think this illustrates well why the statement of need must be improved both in the paper and in the documentation.

What you're attributing to a lack of good design are just different priorities and design decisions we've made in the development of retworkx.

Let me clarify my concerns about design. A good example is the lack of consistency in the representation of edges, see https://github.com/Qiskit/retworkx/issues/512 for details. There are fundamental problems here that will have long-term consequences if you plan to support multigraphs. I am not talking about priorities, but about whether certain things are possible at all.

For example, can you implement a function that computes a cycle basis in Python, not Rust? It seems to me that this is at best very awkward and at worst impossible, partly due to https://github.com/Qiskit/retworkx/issues/509 How about implementing an undirected Dijkstra shortest path finder which returns a sequence of edges (not nodes) whose attributes can later be retrieved?

I am writing this with the assumption that retworkx aims to be suitable for prototyping algorithms. Correct me if this is not the case. (That would bring us back to the statement of need and the description of the library's purpose.)

Regarding current use cases for the library: It would help if some example uses would be presented in the documentation.

For the paper, part of why we didn't go into too much detail there is partially because of the length limits.

Many of the things that are missing from the paper could be included in the documentation instead.

Consider spending the word limit on what matters. The "Use Cases" section, which takes up half of the paper in page count, does not actually discuss any use cases. It just presents benchmarks, and re-states in words some of the things we already see in the figures. This is of course your choice, but I would not include the benchmarks in the paper at all (you can still refer to them), unless you think that they are important to answer the questions I listed at the beginning this comment.

The other aspect was we were trying to avoid putting any project in a negative light.

I do not think that this is a problem. It should be stated clearly what advances and novel capabilities this new library brings. This is not possible without referring to existing libraries.

Being able to iteratively create a graph quickly is important in some applications, especially those that are using networks as part of a larger computation for analysis where the graph is dynamic.

Yes, you are right. This is the type of analysis I expected to see in the statement of need. It would be even better if you can mention specific use cases where the graph must be built incrementally (i.e. first collecting edges, and then adding them to the graph all at once, is not an option).

drvinceknight commented 2 years ago

Apologies for the delay all.

@inakleinbottle I am not sure why you are unable to tick the boxes. Could you confirm that your review is finished and you're happy for the paper to be accepted? (No problem if not, just clarifying for my benefit)

Regarding https://github.com/openjournals/joss-reviews/issues/3968#issuecomment-995197715, my review should be considered complete.

@szhorvat likewise @szhorvat, are you happy for the paper to be accepted? (No problem if not, just clarifying for my benefit)

szhorvat commented 2 years ago

@szhorvat likewise @szhorvat, are you happy for the paper to be accepted? (No problem if not, just clarifying for my benefit)

Here's the gist of my opinion:

My opinion is that the submission to JOSS was premature. This package certainly shows promise, and I think it would make sense to resubmit it in the future, maybe in a year's time. However, trying to work out all the issues within the timeframe of the review does not seem realistic. It would put too much pressure on the authors, as well as on the reviewers.

The package shows good promise and I certainly hope that it will be published in JOSS eventually. However, the quality needs to be improved before then. I am not sure how much times JOSS is willing to give authors to make changes. The reason why I recommended later resubmission was to avoid putting time pressure on all those involved (especially the authors).

I should note that the authors have made many improvements since these reviews were written and are working on a revised version of the paper. https://github.com/Qiskit/retworkx/pull/545

drvinceknight commented 2 years ago

Thank you @szhorvat.

IvanIsCoding commented 2 years ago

@whedon generate pdf from branch joss-paper

editorialbot commented 2 years ago

My name is now @editorialbot

IvanIsCoding commented 2 years ago

@editorialbot generate pdf from branch joss-paper

editorialbot commented 2 years ago

I'm sorry human, I don't understand that. You can see what commands I support by typing:

@editorialbot commands

IvanIsCoding commented 2 years ago

@editorialbot set joss-paper as branch

editorialbot commented 2 years ago

I'm sorry @IvanIsCoding, I'm afraid I can't do that. That's something only editors are allowed to do.

IvanIsCoding commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:warning: An error happened when generating the pdf.

IvanIsCoding commented 2 years ago

@drvinceknight We have revised the software paper and want to generate a new version, however we are struggling with the bot. Can you help setting the branch to joss-paper?

arfon commented 2 years ago

@editorialbot set joss-paper as branch

editorialbot commented 2 years ago

Done! branch is now joss-paper

arfon commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

drvinceknight commented 2 years ago

@editorialbot set joss-paper as branch

Thank you @arfon.

Apologies for the delay @IvanIsCoding, let me know if there's anything else you need at this stage.

szhorvat commented 2 years ago

What is the situation with this paper now @drvinceknight, @IvanIsCoding, @mtreinish, @inakleinbottle?

Are we ready for a new review round?

IvanIsCoding commented 2 years ago

What is the situation with this paper now @drvinceknight, @IvanIsCoding, @mtreinish, @inakleinbottle?

Are we ready for a new review round?

@szhorvat @inakleinbottle The software paper is ready for a new round of review, there are substantial changes compared to the first version.

Addressing the issues opened on our repository is still WIP, we have closed 13 out of the 24 issues with tag joss.

szhorvat commented 2 years ago

The paper is in a much better shape now, and many of the concerns I had were issued. While this library is still young, and has missing features and rough edges, it is making good progress, and show promise. The intended use cases have now been better explained in the paper. I recommend publication in JOSS.

However, I would like to ask the authors to respond about whether they have cleared the name with the NetworkX team. "retworkx" very obviously references "NetworkX", and people will assume a relationship.


Some hopefully useful comments to the authors:

Addressing these would improve the paper:

I believe that the lack of a consistent edge representation throughout the API that allows distinguishing parallel edges is a major weekness, and will become a severely limiting factor in the future (unless a decision is made to just not support multigraphs). https://github.com/Qiskit/retworkx/issues/512

Many other suggestions are in the preceding comments and open issues. I hope that at least some of these were useful.

szhorvat commented 2 years ago

This is a question to the JOSS editors, @drvinceknight

I see several citations of GitHub repos. Does JOSS have guidelines on how to cite repositories, especially how to determine the author list when the project does not provide citation information? For example, what would be an appropriate citation for https://github.com/petgraph/petgraph ?

drvinceknight commented 2 years ago

Thank you for your patience on getting back to you @szhorvat, in general the advice would be to seek guidance from the repository itself which is complicated when there is no specific guidance. I have dropped the project a line to ask myself.

drvinceknight commented 2 years ago

I believe that the lack of a consistent edge representation throughout the API that allows distinguishing parallel edges is a major weekness, and will become a severely limiting factor in the future (unless a decision is made to just not support multigraphs). Qiskit/retworkx#512

Many other suggestions are in the preceding comments and open issues. I hope that at least some of these were useful.

@IvanIsCoding thanks for your patience, when you have a moment, could you let me know where we are in terms of addressing the remaining comments?

IvanIsCoding commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left: