openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
722 stars 38 forks source link

[REVIEW]: mpi4jax: Zero-copy MPI communication of JAX arrays #3419

Closed whedon closed 3 years ago

whedon commented 3 years ago

Submitting author: @dionhaefner (Dion Häfner) Repository: https://github.com/mpi4jax/mpi4jax Version: v0.3.2 Editor: @kellyrowland Reviewers: @1313e, @Himscipy Archive: 10.5281/zenodo.5337529

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/3e6d19d9cccf7620aeaf142feaa9dfb5"><img src="https://joss.theoj.org/papers/3e6d19d9cccf7620aeaf142feaa9dfb5/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/3e6d19d9cccf7620aeaf142feaa9dfb5/status.svg)](https://joss.theoj.org/papers/3e6d19d9cccf7620aeaf142feaa9dfb5)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@1313e & @Himscipy, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @kellyrowland know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @1313e

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @Himscipy

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @1313e, @csadorf it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 3 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.16 s (451.6 files/s, 64757.4 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Python                          44           1620           1510           4475
Cython                           5            385             80           1032
YAML                             9             72             10            258
reStructuredText                 8            217            246            153
Bourne Shell                     2             28             85            147
TOML                             1              5              2             38
DOS Batch                        1              8              1             26
Markdown                         1              5              0             17
diff                             1              3             13             13
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                            73           2347           1954           6168
-------------------------------------------------------------------------------

Statistical information for the repository '31e8fc2678107bcd087fcbc0' was
gathered on 2021/06/25.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Clemens Giuliani                 2           562             49            4.34
Dion Häfner                     57          8326           2750           78.69
Filippo Vicentini               34          1961            427           16.97

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Clemens Giuliani            247           44.0          5.3                4.45
Dion Häfner                6576           79.0          3.5                9.76
Filippo Vicentini           782           39.9          2.0                6.52
whedon commented 3 years ago

PDF failed to compile for issue #3419 with the following error:

 Can't find any papers to compile :-(
kellyrowland commented 3 years ago

@whedon generate pdf from branch joss-paper

whedon commented 3 years ago
Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

whedon commented 3 years ago

:wave: @1313e, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @csadorf, please update us on how your review is going (this is an automated reminder).

1313e commented 3 years ago

Yup, I will be on it next week.

1313e commented 3 years ago

@dionhaefner I have a few questions regarding authorship and the paper itself. I am posting them here such that it is easier to track for others and to allow for feedback.

I wanted to ask why only one of the two of you is stated as being an author of the software in the setup.py file? Shouldn't that be both?

The order of authors on the paper is not the same as the order of the contributors (yes, I can see how made more contributions in terms of lines). Given how important first author usually is, I want to make sure it is done correctly. So, can you confirm that the order is correct?

There is a third person named inailuig, who despite only having 2 contributions, does seem to have made a proper impact on the package. Is there a reason this person is not included as an author on the paper?

The paper itself seems to be a bit too long in my opinion. Please correct me if I am wrong @kellyrowland, but the point of a JOSS paper is to basically give a summary or abstract to your software, just to make software easily citable. The documentation should actually provide all the details and information required for using the package, whereas the package more gives a general description to get others interested in it. In this particular case, there are quite a few code snippets with explanations in the paper that are more well-suited for the online documentation. Furthermore, I also cannot find these code snippets with explanations in the online documentation, so it might be a good idea to move them there.

Finally, the paper mentions in the Outlook that there are quite a few more functionalities that must be implemented for the package. Isn't it maybe a bit too early to already publish it then, even if it is already useful to others (given the number of stars, I can see that it is being used)?

csadorf commented 3 years ago

👋 @csadorf, please update us on how your review is going (this is an automated reminder).

I should be able to complete it by the end of the week.

dionhaefner commented 3 years ago

Thank you very much for your comments @1313e. Answers inline, and I will address your issues on the mpi4jax repo soon.

I wanted to ask why only one of the two of you is stated as being an author of the software in the setup.py file? Shouldn't that be both?

The order of authors on the paper is not the same as the order of the contributors (yes, I can see how made more contributions in terms of lines). Given how important first author usually is, I want to make sure it is done correctly. So, can you confirm that the order is correct?

We flipped a coin for first authorship of the paper, as also indicated in the author list. We feel that both have contributed equally. So the order is correct, yes.

In case of the README and setup.py, we think that full attribution and first-authorship as not nearly as important as on an academic paper. The README just lists contributors in chronological order. But I actually think this list is redundant, as it's already shown in the GitHub UI in the sidebar. So I will remove the contributors section from the README.

In case of setup.py, this is often just the original author or main contact for questions. In my experience it is not meant to give an exhaustive list.

There is a third person named inailuig, who despite only having 2 contributions, does seem to have made a proper impact on the package. Is there a reason this person is not included as an author on the paper?

We value every contribution, but we feel that the bar for academic authorship should be a little bit higher than a one-time code contribution. If I interpret the JOSS guidelines correctly, authorship questions should be settled through "common sense", and we do feel that the author list is an accurate reflection of the work that went into the package.

The paper itself seems to be a bit too long in my opinion. Please correct me if I am wrong @kellyrowland, but the point of a JOSS paper is to basically give a summary or abstract to your software, just to make software easily citable. The documentation should actually provide all the details and information required for using the package, whereas the package more gives a general description to get others interested in it. In this particular case, there are quite a few code snippets with explanations in the paper that are more well-suited for the online documentation. Furthermore, I also cannot find these code snippets with explanations in the online documentation, so it might be a good idea to move them there.

I take it you refer to the second half of the "Implementation" section? We use these code snippets as a device to explain why we need to go out of our way with token management, but if you think they are distracting we could remove them. There are similar examples in the corresponding part of the docs.

If you mean the shallow water application, that one is also part of the docs. Same applies here, if you think the code is distracting we can remove it and refer to the docs, and just include the benchmark timings and the figure.

Finally, the paper mentions in the Outlook that there are quite a few more functionalities that must be implemented for the package. Isn't it maybe a bit too early to already publish it then, even if it is already useful to others (given the number of stars, I can see that it is being used)?

This seems to be a misunderstanding; there is nothing that must be implemented for the package to be useful. There are 2 major use cases for mpi4jax:

  1. To use MPI primitives in JITed code without overhead on CPU and GPU.
  2. To do distributed autograd, where you differentiate through communication calls.

Use case 1 is mature and fully supported, and this is what we describe in the paper (and, as you noted, is already being used in the wild). Use case 2 is not really supported, but might be in the future (but it could also turn out to be technically impossible). I think it is customary to extrapolate a bit into the future in the last section, also to inspire potential contributors. If you prefer that we stick to what is already supported we can remove it.

1313e commented 3 years ago

@dionhaefner

We value every contribution, but we feel that the bar for academic authorship should be a little bit higher than a one-time code contribution. If I interpret the JOSS guidelines correctly, authorship questions should be settled through "common sense", and we do feel that the author list is an accurate reflection of the work that went into the package.

Okay, good. I just wanted to make sure they weren't simply forgotten about.

I take it you refer to the second half of the "Implementation" section? We use these code snippets as a device to explain why we need to go out of our way with token management, but if you think they are distracting we could remove them. There are similar examples in the corresponding part of the docs.

Yes, I am referring to that part. I am not saying it isn't useful, because it most definitely is. It is just a lot of information that I feel is a bit too extensive for the paper itself. But again, if the editor is happy with it, then I am as well.

This seems to be a misunderstanding; there is nothing that must be implemented for the package to be useful. There are 2 major use cases for mpi4jax:

  1. To use MPI primitives in JITed code without overhead on CPU and GPU.
  2. To do distributed autograd, where you differentiate through communication calls.

Use case 1 is mature and fully supported, and this is what we describe in the paper (and, as you noted, is already being used in the wild). Use case 2 is not really supported, but might be in the future (but it could also turn out to be technically impossible). I think it is customary to extrapolate a bit into the future in the last section, also to inspire potential contributors. If you prefer that we stick to what is already supported we can remove it.

Please note that I am not saying that the package isn't useful or feature complete. On the contrary actually. I am more pointing out that more functionality seems to be planned for the future, as you mention as well, and to me it seemed to be quite a bit. I was therefore simply wondering whether it would be a good idea to publish now instead of later, when the paper can also introduce those added functionalities.

Hope my comments didn't feel too direct. I actually really like the package and made me interested in JAX and mpi4jax as well. :)

dionhaefner commented 3 years ago

Thanks for the clarification! Sounds like we are on the same page. I think your comments were perfectly appropriate.

I was therefore simply wondering whether it would be a good idea to publish now instead of later, when the paper can also introduce those added functionalities.

We have considered that, but given that we don't know whether autgrad-through-MPI will ever be part of mpi4jax (either because we don't have the resources or because of fundamental technical obstacles, perhaps both), we decided that it was a good time to publish now.

PhilipVinc commented 3 years ago

We flipped a coin for first authorship of the paper, as also indicated in the author list. We feel that both have contributed equally. So the order is correct, yes.

Bt the way, @dionhaefner , I noticed just now, but in my field it is customary to put the asterisk of "both authors contributed equally" on both authors (not only the second). So maybe we can address that?

Finally, the paper mentions in the Outlook that there are quite a few more functionalities that must be implemented for the package. Isn't it maybe a bit too early to already publish it then, even if it is already useful to others (given the number of stars, I can see that it is being used)?

I am writing an article for netket to be submitted in a few months and i would like to cite mpi4jax because NetKet could not work without it. I believe mpi4jax is now mature and battle tested by netket users enough to warrant a paper. (In fact, mpi4jax was born as a subpackage in netket and was soon moved out once i started discussing with @dionhaefner because it looked like it could gain more traction.)

About not being complete: me and dion see different value in mpi4jax. He sees part 1, the collective communication, as the most important part, while i am mostly interested in part 2, that is autodiff through mpi calls, which is what I need in NetKet. As he said, part 1 is feature-complete (bar a few exotic mpi calls, which i don't think warrant stalling the article) while part 2 is not. Many things do work, even if we don't mention them in the article (mainly collective communication), but some (send/recv) do with very peculiar limitations (like only in reverse mode, or only in forward mode) due to limitations in jax that even I am not 100% sure if they can be overcome or not.

For that reason we agreed to focus the JOSS paper on part 1 (MPI), and mention part 2 (AD) as something that can be fully fleshed out in the future, even if we do already support AD in the more common use-cases.

dionhaefner commented 3 years ago

@whedon generate pdf from branch joss-paper

whedon commented 3 years ago
Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

dionhaefner commented 3 years ago

Bt the way, @dionhaefner , I noticed just now, but in my field it is customary to put the asterisk of "both authors contributed equally" on both authors (not only the second). So maybe we can address that?

We had that in the first draft, but at some point it disappeared, so I assumed you removed it :) Anyhow, it's back.

csadorf commented 3 years ago

@kellyrowland I did not realize that one of the authors is affiliated with institution I am employed by: École Polytechnique Fédérale de Lausanne (EPFL). According to the COI Policy, this constitutes a COI. I have not personally interacted or collaborated with Filippo Vicentini or their host laboratory. I am going to continue with the review process, but will wait on confirmation on whether the COI is waived or not before submitting any comments.

kellyrowland commented 3 years ago

Hi @csadorf - thanks for bringing this up. I will try to find another reviewer in the interest of conservatism with respect to the COI policy. Having said that, if we're unable to find another reviewer, it does sound like there is enough separation in this case that we could waive it.

kellyrowland commented 3 years ago

Hello @adam-m-jcbs @gonsie 👋 would either of you be available and willing to review this paper for JOSS?

gonsie commented 3 years ago

Sorry, I’m not available to provide a review.

kellyrowland commented 3 years ago

No worries, thanks for the response!

kellyrowland commented 3 years ago

Hello @mikaem @Himscipy - would either of you be available and willing to review this for JOSS? It looks like both of you may have ongoing JOSS reviews, so no worries about declining this.

Himscipy commented 3 years ago

👋🏼 @kellyrowland, I am available to review this JOSS submission. Thank you for reaching out.

kellyrowland commented 3 years ago

Wonderful, thank you! I will assign you as a reviewer on the issue shortly.

kellyrowland commented 3 years ago

@whedon remove @csadorf as reviewer

whedon commented 3 years ago

OK, @csadorf is no longer a reviewer

kellyrowland commented 3 years ago

Hi @csadorf - I've removed you as a reviewer here to avoid the COI. Thank you again for bringing that to my attention.

kellyrowland commented 3 years ago

@whedon add @Himscipy as reviewer

whedon commented 3 years ago

OK, @Himscipy is now a reviewer

csadorf commented 3 years ago

Hi @csadorf - I've removed you as a reviewer here to avoid the COI. Thank you again for bringing that to my attention.

Got it. Happy to review other submissions in the future!

1313e commented 3 years ago

@kellyrowland I am done with my review and happy for this paper+software to be accepted. I think it is a great addition to the open-source community. :)

dionhaefner commented 3 years ago

Thanks for the review @1313e!

kellyrowland commented 3 years ago

Hi @Himscipy just checking in - please let me know about the status of your review or if you need to step away from it.

Himscipy commented 3 years ago

Hi @kellyrowland , Thanks for checking. I plan to finish it over the weekend, if that okay.

kellyrowland commented 3 years ago

@Himscipy that sounds great, thank you!

Himscipy commented 3 years ago

Hi @dionhaefner, Following are some comments regarding the submission :

  1. The software tools wraps mpi4py library with jax, is a very interesting idea and tool such as this will be really beneficial for scientific community. With that said, I find that the paper doesn't describe findings the finding from the second example listed in the introduction section of the paper(NetKet).
  2. The paper does describe the scalability for the a 2D shallow water on ~6M grid, but doesn't clearly mention about the workload handled by each worker, and the mesh partition strategy a user can directly leverage from the tool. For the general application perspective, it's important to clearly describe this in the paper.
  3. In addition if you also provide a 3D example case and the working mesh partitioning strategies, the usability of the tool can increase multifold. I am not aware of existing JAX based Numerical scientific solvers, if there are such existing codes, a impactful use case would be to show how your software tool can integrate with it seamlessly. The paper does mention about a Jax based code "Veros" but did not reports the scalability and performance. In addition the "Veros " documentation does list mpi4Jax for distributed computing.
  4. As the work talks about communication and distributed computing, for HPC community it is important to have code profiling and timing results from the profiler to demonstrate the applicability of the tool.
  5. While the reported benchmarks are on small machine, I recommend to test the wrapper performance on large clusters as well and report their results and findings in the paper.
  6. The "zero copy " mentioned in the paper title is never explained or elaborated in the paper. My recommendation would be to provide explanation about it in the paper.
dionhaefner commented 3 years ago

Thanks for the review @Himscipy.

  1. The software tools wraps mpi4py library with jax, is a very interesting idea and tool such as this will be really beneficial for scientific community. With that said, I find that the paper doesn't describe findings the finding from the second example listed in the introduction section of the paper(NetKet).

I'm not sure I follow. What kind of extra information would you expect? I'm sure @PhilipVinc would be happy to provide.

  1. The paper does describe the scalability for the a 2D shallow water on ~6M grid, but doesn't clearly mention about the workload handled by each worker, and the mesh partition strategy a user can directly leverage from the tool. For the general application perspective, it's important to clearly describe this in the paper.

I have no problem with mentioning the mesh partitioning and grid points per worker. I'll add it to the manuscript. I'm not sure though what you mean by "a user can directly leverage from the tool". All interaction with communicators is handled on the user side via mpi4py, so there's nothing we add or take away.

  1. In addition if you also provide a 3D example case and the working mesh partitioning strategies, the usability of the tool can increase multifold. I am not aware of existing JAX based Numerical scientific solvers, if there are such existing codes, a impactful use case would be to show how your software tool can integrate with it seamlessly. The paper does mention about a Jax based code "Veros" but did not reports the scalability and performance. In addition the "Veros " documentation does list mpi4Jax for distributed computing.

I agree that a 3D example would be great, but it's not exactly something I have "lying around", and since JAX is relatively new there's not a huge ecosystem to fall back to. If you have a simple 3D toy example I would be happy to see if we can make it work - that would definitely make a good addition to the docs.

I am the maintainer of Veros, and yes we do use mpi4jax to scale to small clusters and dozens of GPUs. I could add a link to the Veros benchmarks page to the paper?

  1. As the work talks about communication and distributed computing, for HPC community it is important to have code profiling and timing results from the profiler to demonstrate the applicability of the tool.

We could try and quantify the overhead incurred by using mpi4jax by benchmarking it against mpi4py. Would that sound like a useful addition? Otherwise you'll need to be more specific what we should profile, since all actual communication happens inside mpi4py / MPI.

  1. While the reported benchmarks are on small machine, I recommend to test the wrapper performance on large clusters as well and report their results and findings in the paper.

We benchmark Veros for up to 256 cores (see link above). Would it be sufficient to link to that, given that we also measure overhead vs mpi4py (4)? (Besides that, I don't have access to a large cluster.)

  1. The "zero copy " mentioned in the paper title is never explained or elaborated in the paper. My recommendation would be to provide explanation about it in the paper.

Sure, I'll add a sentence.

PhilipVinc commented 3 years ago

If I can add something...

  1. As the work talks about communication and distributed computing, for HPC community it is important to have code profiling and timing results from the profiler to demonstrate the applicability of the tool.

While I do agree that benchmarks in general are a usefull tool, i don't think they are really relevant here. mpi4jax is literally just a wrapper+autodiff rules, as you recognise yourself in your question (5). It just turns out that writing a wrapper for jax is not super-simple and we would like to be able to cite our work for it.

Properly benchmarking mpi4jax will, in the end, tell us what is the cost of a function call in XLA and how performant is the particular MPI implementation of the (super)computer used in the benchmark. Moreover, in any reasonably-sized workload this cost will be negligible compared to the MPI operation itself.

I don't think that this information is particularly useful. (Addendum: It is also a moving target: customcall runtime cost has been halved in recent versions of jaxlib/XLA and depends on the library version the user has installed).

dionhaefner commented 3 years ago

FWIW, I did some quick benchmarks and found that mpi4jax incurs an overhead of about 800ns (so way beyond noise level in any application I can think of).

I think that could be a nice result to add to the paper, to show that we actually achieve zero-copy communications with minimal overhead.

Himscipy commented 3 years ago

I'm not sure I follow. What kind of extra information would you expect? I'm sure @PhilipVinc would be happy to provide. I don't expect extra information, here I mean that in the introduction section you elucidate the NetKet example, but in the result section you only show the results of the shallow water equation. Providing the details of all the examples you list in the introduction section, will provide completeness to the paper. In addition, I didn't find any details on the NetKet in the documentation page also.

I have no problem with mentioning the mesh partitioning and grid points per worker. I'll add it to the manuscript. I'm not sure though what you mean by "a user can directly leverage from the tool". All interaction with communicators is handled on the user side via mpi4py, so there's nothing we add or take away.

I think that could be a nice result to add to the paper, to show that we actually achieve zero-copy communications with minimal overhead. Please add that in the paper, this will further strengthen the tool usability.

dionhaefner commented 3 years ago

I don't expect extra information, here I mean that in the introduction section you elucidate the NetKet example, but in the result section you only show the results of the shallow water equation. Providing the details of all the examples you list in the introduction section, will provide completeness to the paper. In addition, I didn't find any details on the NetKet in the documentation page also.

Sounds like there is a misunderstanding. NetKet and Veros are users of mpi4jax and not directly connected to the project. We just include a brief description of them in the paper to show what is possible with the library.

The shallow water example code has nothing to do with either of them, it's just that - an example.

I will update the paper with the other suggestions ASAP.

Himscipy commented 3 years ago

@kellyrowland I am done with my review and I wish good luck to the authors for this paper+software and hope that they continue developing the tool further. Please update the paper and I will be happy to take another look if needed. Thank you.

dionhaefner commented 3 years ago

@whedon generate pdf from branch joss-paper

whedon commented 3 years ago
Attempting PDF compilation from custom branch joss-paper. Reticulating splines etc...
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

dionhaefner commented 3 years ago

I'm done with incorporating all the suggestions. Thanks again for the review @Himscipy!

kellyrowland commented 3 years ago

Thanks @Himscipy !

@dionhaefner can you please create an archive for this release version (on Zenodo, figshare, or elsewhere) and then post the archive DOI here?