[joss] Editorial paper proofreading requests :)

Part of: https://github.com/openjournals/joss-reviews/issues/7286

Some very small proofreading changes from reading the paper. Overall it's great, just some suggestions that might improve the strength of the paper. The minor suggestions are up to you, take them or leave them, but I would like some clarification on the medium/general suggestions/questions for the purposes of making the paper as strong as it can be :)

Minor/grammatical suggestions:

L29: "who use ESNs implement them from scratch" -> "who use ESNs to implement them from scratch."
L35: "streamlined, based on the" -> "streamlined based on the"
L49: "optimization, in addition" -> "optimization in addition"
L51: "Bayesian Optimization" -> "Bayesian optimization"
L66-68: list is a bit awkward, maybe something like "map them across available resources, from a laptop to a distributed HPC cluster." - if i'm reading your purpose there correctly re: emphasizing how dask lets you scale from small compute to big compute?
L79: unclear to me if this is just showing another similar package, or if this is the one that you build xesn on top of as mentioned in L62? If it's the former, the wording is a bit awkward - "finally" in a list of two - could be something like "Another implementation is that of Arcomano et. al (2020, 2022, 2023), available at..." If it's the latter, it would be nice to make that more specific :)
L86: i think this should probably be "the methodology section of the documentation" - in case that link changes, someone can know what page you're referring to.
Fig 1 caption and elsewhere: "Walltime" -> "Wall time" in the caption (don't worry about changing the labels in the figure) - am aware this is jargon in profiling, but adding the space comes across as mildly less like jargon :)

Medium questions/suggestions:

Fig 1: I think what you mean with the dotted lines is that those are the theoretical minimums for memory? i.e. what one would expect if one were to just multiply the theoretical size of the dtype x params? Caption is not clear on that if so. I am also not sure what the equation is in the legend or how the a and b constants were empirically derived - would love to see some more description there.
Fig 1: Are there also dotted lines on the solid orange/green lines? I can't really see those. If the dotted lines overlap with the solid/dashed lines, it would be nice to say so in the caption: "dotted lines are hidden because the measured memory usage visually overlaps with the theoretical minimums"
Fig 1: Would it be possible to make the visual distinction between dashed vs. solid lines in the figure legend clearer? It just looks like my PDF is having trouble rendering. Maybe decrease the size of the dashes in the legend, or reposition the open parts of the dash so they fall on either side of the center dot rather than almost underneath it?
Fig 1: Are the memory measurements average or peak memory usages? peak memory use is usually the limiting factor, as it causes the run to crash, so that might be nice to clarify.
Fig 1: Are the memory measurements during GPU runs just the system memory usage, the GPU memory usage, or both? would like to have clarification in the legend.
L102 and section: Is "strong scaling" a specific term of art? I'm not sure what it means here.
L105-106: what do you mean by "more or less fixed?" Help me understand why it might not be fixed. Is it something like theoretically they are the same complexity of problem, but minor variation in some implementation details causes them to be insignificantly different?
L110: What does the default LocalCluster configuration end up meaning for your resources?

General suggestions:

It would be nice to have a link from the two experiments to the code in the repo used to conduct them - I see the scaling directory, and I think a permalink to that would be great for readers who might not have the repo in front of them/if you choose to change repo structure in the future.
Seeing the GPU wall time flat makes me think that most of that time is spent compiling the GPU code/booting the GPUs - you might consider an additional set of samples with very high Nr to show how large the system can scale on GPU. That would be for the purposes of showing off your work, so not required, just thought i'd let you know that's what i assume when i'm reading it and it may make for a stronger conclusion.

Thanks for these @sneakers-the-rat!

Minor suggestions

I agreed with all of the changes you suggested, and made those in the text. These suggestions definitely cleared up a few things, thanks! Just to answer a few questions (even though I think you got the gist of what was going on):

L66-68: list is a bit awkward, maybe something like "map them across available resources, from a laptop to a distributed HPC cluster." - if i'm reading your purpose there correctly re: emphasizing how dask lets you scale from small compute to big compute?

That's correct. I just added "cloud" to your suggestion in order to indicate that it can work in the cloud too.

L79: unclear to me if this is just showing another similar package, or if this is the one that you build xesn on top of as mentioned in L62?

Yeah it was just to list another implementation, so I went with your wording suggestion. Thanks!

Medium questions/suggestions:

Thanks for all of the suggestions regarding the Fig 1 caption, I'll add some answers here and then copy/paste the new text below (which I'll merge once you agree with the changes).
- The dotted lines are a bit misrepresented by the text - "theoretical" is not the way I should've described them. They are just there to provide the user some guidance on how much memory they might need. I actually derived the lines from the memory usage measurements, using constants that took into account the problem size (input dimension and hidden dimension). Hopefully the new caption clears that up, but let me know what you think.
- Yeah, those lines were overlapping visually and it was generally confusing. I changed this figure so that the measured values are just dots (CPU) or diamonds (GPU) and the derived lines are solid lines that lay on top. Hopefully this is a bit more clear.
- I modified the dash frequency, and also changed the markers so they're different between CPU and GPU. Let me know what you think about this.
- The memory usage is peak memory, definitely a good call to clarify that!
- Also yep, GPU memory just indicates peak memory usage on the GPU, since that's the usual bottleneck. Another good clarification
Here's the new figure and caption

Screenshot 2024-10-30 at 11 50 22 AM

Regarding strong scaling, yes this is a type of computational scaling. Formally it refers to scaling that can be determined by Amdahl's law, so I added the reference to that original paper to the text. It's a fairly commonly used term, but let me know if you'd like more description there
My usage of "more or less" is not really appropriate here, so I removed it - sorry about that. The problem size is most definitely fixed. Re-reading this months later, I can only assume that my "more or less" was referring to the fact that the way dask interacts with the problem can be less intuitive, but that's not really relevant in this context.

The defaults have been added, again good to know (even I've since forgotten...). Here's the new text:

On the CPU resource, a GCP `c2-standard-60` instance, the default
`dask.distributed.LocalCluster` has 6 workers, each with 5 threads.
On the GPU resource, a GCP `a2-highgpu-8g` instance, the default
`dask_cuda.LocalCUDACluster` has 8 workers, each with 1 thread.

General suggestions

Definitely agreed regarding the permalink, I added that to the first paragraph in the "Computational Performance" section.
Thanks for the suggestion on the larger experiments. Unfortunately I don't have enough GCP allocation right now to afford additional tests, even though you're right, this would show off the code more.

FYI @sneakers-the-rat the changes I made to the text can be seen in #80. The paper draft workflow is there, so a pdf can be found, e.g. here

They are just there to provide the user some guidance on how much memory they might need.

ah! yes that clears things up :) if they're just to give a sense of what parameters influence memory usage and to give me an estimate, that's very helpful! I might even suggest moving that to the main text of the paper or even to the docs - it's rare to get something like that for a package, for it to let me know this accurately how many resources it requires!

I changed this figure so that the measured values are just dots (CPU) or diamonds (GPU) and the derived lines are solid lines that lay on top.

Getting there! I truly don't mean to be a stickler on this, and you can feel free to disagree, but now the different plotting styles of left and right panel make it harder to parse for me - dot shapes are in general less salient markers of difference than color or line style, but the difference between GPU and CPU is arguably the most important contrast. Asking the question to myself "what is it that the gray/black estimate lines are doing for the figure," and I think they don't add a lot to the figure per se as much as I think they would be helpful as a heuristic for potential users. Their semantic content in the figure I think is to demonstrate that the empirical heuristic formula for memory use matches the measured samples, but since the parameters are empirical (rather than derived from something like {dtype_scalar_size}*{array_shape}*... where showing close match between theory and measurement is informative, I think they might serve you better outside of the figure entirely. If you take them of the figure and move them to the main text, then I think you might simplify the figure, caption, and also give a clearer indication to the reader that this formula might be useful to them rather than them thinking of it as a result that they are to read but not necessarily use :) (it is a result, but hopefully my meaning is clear).

I modified the dash frequency

the legend on the left is definitely clearer now! thanks! I am neutral on the dot shapes, up to you whether you like them or not :).

It's a fairly commonly used term, but let me know if you'd like more description there

If this is a term of art that you would expect your readers to know, that's good enough for me. (I'm also one of these newfangled anti-ivory-tower types who thinks it's not only fine but a favor to your reader to shortcut them to wikipedia articles for basic concepts they might need a refresher on, but that's definitely not for everyone so i only mention it as an aside, the added reference is great and i'm sure will be a meaningful indicator of what you mean to your readership!)

The problem size is most definitely fixed. ... "more or less" was referring to the fact that the way dask interacts with the problem can be less intuitive

I know this tension well. I think taking it out lets me see through to what you were saying more easily without getting distracted.

Unfortunately I don't have enough GCP allocation right now to afford additional tests, even though you're right, this would show off the code more.

that is totally fine. So fine that if "ran out of compute credits" was in a methods section as a limitation i wouldn't bat an eye. As a curiosity question, have you profiled to see if my hunch was right? that once the code starts running it finishes extremely quickly, and most of that time is spent in cuda compilation? or is there some other reason it's flat?

rest of the text looks great!

Hi @sneakers-the-rat thanks for all of the comments and feedback! Here's my reply for the last remaining questions/issues.

On Figure 1, I really appreciate your input, but I respectfully disagree. I think that having the black and gray lines in the plot gives the user an intuitive understanding for how well they can expect the equations to map to their application. I could make the two plots similar by deriving comparable curves for walltime, but I expect this to be less machine independent than memory.

I am having a change of heart, and will flip the colors to indicate CPU vs GPU, and the markers/linestyle to indicate system state size. But I do strongly prefer having the gray/black lines in the plot. Here's the latest version.

Screenshot 2024-10-31 at 11 59 38 AM

If this is a term of art that you would expect your readers to know, that's good enough for me.

I went in and added a more intuitive description, and please note that my previous edit included a citation to Amdahl's law (still there)

We then fix the problem size such that $N_r*N_g = 16,000$, so that
the timing results reflect strong scaling.
That is, the results show how the code performs with increasing resources on a fixed problem
size, which in theory correspond to Amdahl's Law [@amdahl_1967].

As a curiosity question, have you profiled to see if my hunch was right? that once the code starts running it finishes extremely quickly, and most of that time is spent in cuda compilation? or is there some other reason it's flat?

I think that you're right, that would be my guess too, but I haven't profiled the code to this level of detail.

I think that having the black and gray lines in the plot gives the user an intuitive understanding for how well they can expect the equations to map to their application.

Fair enough! It is your figure :). I like the swap of the colors/shapes, that makes more sense to me. Also putting the dots on top of the lines helps a lot.

What about using solid/dashed lines instead of black/gray? They're visually distinct enough that it wouldnt get confused with the solid/dotted on the left. Black/gray is just a bit hard to tell apart.

Also since the gray lines follow the CPU, and the black follows GPU, what do you think about replacing the legend labels for those lines with some readable shorthand for the equation like "calculated CPU memory" (or whatever you think is accurate, just an example) and then putting the equation in the caption? The equation in the legend already needs someone to read the caption to know what the terms are, so might as well give the eqs a name so people get a semantic handle for them?

Edit: one more thing - it's obvious if you think about it, but it may be worth saying in the caption that the dots/triangles overlap on the right plot but there are still two curves there. On the left the dashed line makes it clear there are two GPU samples, but on the right it just looks like one when I am zoomed out to full paper view. Up to you, I think it could be inferred from the legend, im just a fan of captions that anticipate visual confusion.

I could make the two plots similar by deriving comparable curves for walltime, but I expect this to be less machine independent than memory.

Agreed. This would probably be too variable to be useful.

I went in and added a more intuitive description, and please note that my previous edit included a citation to Amdahl's law

This looks great. I saw and went and did a bit of reading yesterday, the power of a good reference ❤️.

Hi @sneakers-the-rat, thank you for all of these comments!

What about using solid/dashed lines instead of black/gray? They're visually distinct enough that it wouldnt get confused with the solid/dotted on the left. Black/gray is just a bit hard to tell apart.

I appreciate the suggestion, but I will keep the lines solid, since the difference between dashed and solid lines indicates the difference between the two system sizes.

Also since the gray lines follow the CPU, and the black follows GPU, what do you think about replacing the legend labels for those lines with some readable shorthand for the equation like "calculated CPU memory" (or whatever you think is accurate, just an example) and then putting the equation in the caption?

Thanks for the suggestion! I would prefer to keep it how it is, with the equation clearly stated in the legend. I did however modify the following phrase in the caption to be more specific about which line goes to CPU and which to GPU:

"The gray and black lines indicate the general trend in memory usage for the CPU and GPU simulations, respectively.

Also, I added subtitles to the legend so that it's clear which line goes with CPUs and which goes with GPUs.

it may be worth saying in the caption that the dots/triangles overlap on the right plot but there are still two curves there

Thanks, I added a white markeredgeborder to the plot to make this more apparent.

Here is the latest figure and caption. I hope we have converged, but let me know if you have additional comments.

Screenshot 2024-11-01 at 9 37 03 AM

great, looks good to me :)

Thanks @sneakers-the-rat!

timothyas / xesn