openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
713 stars 38 forks source link

[REVIEW]: A C++17 Thread Pool for High-Performance Scientific Computing #3291

Closed whedon closed 2 years ago

whedon commented 3 years ago

Submitting author: @bshoshany (Barak Shoshany) Repository: https://github.com/bshoshany/thread-pool Version: v1.5 Editor: @gkthiruvathukal Reviewer: @zeroset, @cedricchevalier19 Archive: Pending

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/dea12b5c525e7107dde41178a87e42cb"><img src="https://joss.theoj.org/papers/dea12b5c525e7107dde41178a87e42cb/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/dea12b5c525e7107dde41178a87e42cb/status.svg)](https://joss.theoj.org/papers/dea12b5c525e7107dde41178a87e42cb)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@zeroset & @cedricchevalier19, please carry out your review in this issue by updating the checklist below. If you cannot edit the checklist please:

  1. Make sure you're logged in to your GitHub account
  2. Be sure to accept the invite at this URL: https://github.com/openjournals/joss-reviews/invitations

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @gkthiruvathukal know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Review checklist for @zeroset

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

Review checklist for @cedricchevalier19

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

whedon commented 3 years ago

Hello human, I'm @whedon, a robot that can help you with some common editorial tasks. @zeroset, @cedricchevalier19 it looks like you're currently assigned to review this paper :tada:.

:warning: JOSS reduced service mode :warning:

Due to the challenges of the COVID-19 pandemic, JOSS is currently operating in a "reduced service mode". You can read more about what that means in our blog post.

:star: Important :star:

If you haven't already, you should seriously consider unsubscribing from GitHub notifications for this (https://github.com/openjournals/joss-reviews) repository. As a reviewer, you're probably currently watching this repository which means for GitHub's default behaviour you will receive notifications (emails) for all reviews 😿

To fix this do the following two things:

  1. Set yourself as 'Not watching' https://github.com/openjournals/joss-reviews:

watching

  1. You may also like to change your default settings for this watching repositories in your GitHub profile here: https://github.com/settings/notifications

notifications

For a list of things I can do to help you, just type:

@whedon commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@whedon generate pdf
whedon commented 3 years ago
Software report (experimental):

github.com/AlDanial/cloc v 1.88  T=0.04 s (48.9 files/s, 30918.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Markdown                         1            161              0            605
C/C++ Header                     1             51            210            238
-------------------------------------------------------------------------------
SUM:                             2            212            210            843
-------------------------------------------------------------------------------

Statistical information for the repository '73970e7a8f9b43c771040fa3' was
gathered on 2021/05/17.
The following historical commit information, by author, was found:

Author                     Commits    Insertions      Deletions    % of changes
Barak Shoshany                   6           583             84          100.00

Below are the number of rows from each author that have survived and are still
intact in the current revision:

Author                     Rows      Stability          Age       % in comments
Barak Shoshany              499           85.6          0.0               42.08
whedon commented 3 years ago

PDF failed to compile for issue #3291 with the following error:

 Can't find any papers to compile :-(
whedon commented 3 years ago

:wave: @zeroset, please update us on how your review is going (this is an automated reminder).

whedon commented 3 years ago

:wave: @cedricchevalier19, please update us on how your review is going (this is an automated reminder).

gkthiruvathukal commented 3 years ago

@zeroset and @cedricchevalier19, can you let me know how things are going with the review for this submission?

zeroset commented 3 years ago

@zeroset and @cedricchevalier19, can you let me know how things are going with the review for this submission?

Sorry for being late, I try to review soon.

zeroset commented 3 years ago

@gkthiruvathukal you know how to produce the pdf from the markdown readme?

Unfortunately I am not able to mark the checkboxes.

gkthiruvathukal commented 3 years ago

@zeroset You can use the whedon generate command. You should be able to check the boxes. I can check with the EICs if you continue to experience problems with the checklist.

zeroset commented 3 years ago

I still can't mark the checkboxes. Seems like my joss-review invitation is expired.

zeroset commented 3 years ago

@whedon generate pdf from branch paper

whedon commented 3 years ago
Attempting PDF compilation from custom branch paper. Reticulating splines etc...
whedon commented 3 years ago

:point_right::page_facing_up: Download article proof :page_facing_up: View article proof on GitHub :page_facing_up: :point_left:

zeroset commented 3 years ago

@whedon check references

gkthiruvathukal commented 3 years ago

@openjournals/joss-eics Can someone help to fix the problem where @zeroset cannot mark the checkboxes. Do I need to reinvite him as a reviewer?

danielskatz commented 3 years ago

@whedon re-invite @zeroset as reviewer

@gkthiruvathukal - yes ☝️

whedon commented 3 years ago

OK, the reviewer has been re-invited.

@zeroset please accept the invite by clicking this link: https://github.com/openjournals/joss-reviews/invitations

zeroset commented 3 years ago

Thank you @danielskatz

zeroset commented 3 years ago

Dear @gkthiruvathukal and @bshoshany unfortunately I have doubts that this submission meets the required substantial scholarly effort. The submitted software contains 238 lines of actual C++ code. The reminder is comments and markdown. Referring to JOSS Guidelines the submission should contain at least 300 lines of code or better 1000 LOC.

bshoshany commented 3 years ago

Dear @zeroset, the relation between code size and effort with respect to this submission was previously discussed in the pre-submission enquiry. There, I explained that making the code as short as possible was an explicit design choice, in order for the thread pool to have the smallest memory footprint possible. That's not a bug, it's a feature that I actually worked quite hard to achieve.

@gkthiruvathukal and @mjsottile both commented in the pre-submission enquiry that code size is not always a measure of effort or utility, and that shorter and more concise implementations reflect a higher degree of effort in order to achieve brevity. @gkthiruvathukal further commented that the effort put into making the very detailed and well put-together documentation should also count in the submission's favor. They therefore suggested that the software should nonetheless be considered to be within the scope of the journal, a suggestion that @danielskatz concurred with. Please see the full discussion for more details.

zeroset commented 3 years ago

Ok, sorry I didn't know the discussion done upfront. I will continue my review ;)

bshoshany commented 3 years ago

My bad, I should have mentioned it earlier!

zeroset commented 3 years ago

@bshoshany could you provide the code for the performance tests you mention in the readme? If you don't mind i could provide you the performance results on ARM based machine.

gkthiruvathukal commented 3 years ago

Thanks, @zeroset. Yes, just to clarify, I was one of the editors who advocated for this submission to be considered. While we have guidelines, submissions are always considered based on intellectual merit and their potential to advance research software.Having worked on lightweight libraries/frameworks for concurrency and threading myself, I saw significant merit in reviewing this submission.

So thanks for proceeding with review!

To our esteemed reviewers, for future reference, you can see the pre-review discussion atop this review issues's thread. This will show the full discussion leading to our decision to review:

https://github.com/openjournals/joss-reviews/issues/3267

bshoshany commented 3 years ago

@bshoshany could you provide the code for the performance tests you mention in the readme? If you don't mind i could provide you the performance results on ARM based machine.

Sure, I'd love to be able to add results for additional architectures! The code for the performance tests is in the repository for my multithreaded matrix class template. I just updated it with the latest version of that template and the test code. For the tests of the thread pool I took the optional third argument of measure_operations() to be false, since I did not need to test the performance of the expression templates (which is only relevant for the matrix class itself).

Also, I noticed that you commented in this issue regarding unit tests for the thread pool class. Were you implying that the unit tests would be needed to pass the review? I was planning to add them in a future release, because I'm a bit busy with other projects right now, but I can try to expedite them if you think they are necessary for the review process.

gkthiruvathukal commented 3 years ago

@bshoshany From my point of view, I would at least like to see the beginning of unit tests in place, even if there is not 100% coverage. I think unit testing is a necessity for all code, but especially for C/C++ libraries and frameworks. I've been using GoogleTest in my advanced systems programming course, and it's really good (and surprisingly easy to use once you have things set up right). If you can at least get your initial tests in place, I think this would satisfy me and the reviewers.

I have a minimum viable application for Google Test. You might find it useful as a starting point. (It's really basic!) I actually use Google Test (a C++ framework) to test C code, too.

https://github.com/gkthiruvathukal/googletest-mva

zeroset commented 3 years ago

I think at least some tests are necessary for a software publication. @gkthiruvathukal posted a good start on that one. If you don't want to use googletest you could also have a look at Catch2. For the Performance measurement i would like to see the benchmark script in the repository as well. Without it, the performance results are neither reproducible nor reasonable. Unfortunately i am also quite busy at the moment, that is why I can only do the review step by step.

bshoshany commented 3 years ago

Dear @gkthiruvathukal and @zeroset:

I would at least like to see the beginning of unit tests in place, even if there is not 100% coverage.

That sounds fair. But maybe I can just use existing tests I already wrote, and no third party tools, at least for now?

Like many authors publishing in this journal, I'm a scientist by training, not a professional software developer. This would be the first time I'm developing unit tests, and unfortunately I am not yet familiar with tools such as GoogleTest, Catch2, or GitHub Actions (which @zeroset mentioned in the issue).

As easy as these tools are to use, and as much as I want to advance my software development skills by learning them, it would still take me some time. Developing the tests themselves would also take a considerable amount of time. As I said above, I have other projects to work on this summer (research, course development, grant applications...), some of which have upcoming deadlines, so I won't have time to undertake this task until at least September.

However, there is already a performance test, which I used to generate the results in the documentation, and there is a large number of working code examples in the documentation, which can be used for testing every single function of the thread pool library.

Therefore, would it be okay if, instead of an automated test using third party tools, I will simply combine the existing tests into one big .cpp file, which will (automatically) run tests for each feature, and include that file in the repository? I'm afraid that's really the best I can do at the moment, unless you want to postpone the review by a few months 😕

For the Performance measurement i would like to see the benchmark script in the repository as well. Without it, the performance results are neither reproducible nor reasonable.

It does exist, just in a different repository.

There are two projects, the thread pool class and the matrix class, and they are being developed completely independently, so I don't want to merge the repositories. It just so happens that matrix operations are a great way to benchmark multithreading.

All you need to do is have thread_pool.hpp, matrix.hpp, and matrix_test.cpp in the same folder, and you can run the test and get the results, although your results will, of course, be different than mine (I'm looking forward to seeing the results you get on the ARM architecture!)

Since I would prefer to keep the two projects in two separate repositories, would it be sufficient, for the purpose of this review, if I just added the performance test files (matrix.hpp and matrix_test.cpp) to the "paper" branch of the thread pool repository, but not to the master branch?

Thanks!

bshoshany commented 3 years ago

Hi @gkthiruvathukal and @zeroset,

I updated the paper branch to v1.8 with a few small changes and bug fixes, including an additional line in paper.md to indicate that submit() now forward exceptions from tasks.

I have not yet received a reply from either of you to my comment from about a month ago. Are you able to proceed with the review without the unit tests, which I will prepare as soon as I have time, but probably not until a couple of months from now?

Thanks, Barak

gkthiruvathukal commented 3 years ago

@bshoshany I would be able to live without unit tests as long as there are some clear examples that can demonstrate the entire API is working properly and the expected outputs being produced.

bshoshany commented 3 years ago

@gkthiruvathukal sounds fair! I will collect all the examples from README.md into one .cpp file that you can execute once to test every single aspect of the library. I'll let you know when I've uploaded it. Thanks again!

gkthiruvathukal commented 3 years ago

@bshoshany Consider the possibility of having separate test programs that show how to use various aspects of the library. I always like concise examples that serve the dual purpose of performing tests and doing proper documentation. But you can also have an all-in-one driver (perhaps a script) that runs the various examples and ensures they perform the expected mission!

bshoshany commented 3 years ago

@gkthiruvathukal If it's okay with you, it would be easier for me, for now, to have one source file that runs all the examples instead of separate source files executed one by one via a script - the end result will be the same.

The only caveat is that they're only examples and not automated tests, so in order to see if they work or not, you will have to look at the output and compare it with what you expect to get. But it would still be a good way to ensure every aspect of the library works properly by just running one command, and will also be great help for me in future updates, to make sure I'm not breaking anything whenever I make a change. The expected outputs are also in the README.md, but I will add a separate file with the expected output of the entire test program, for convenience - including the outputs of any new examples I'll add in the process. This will also serve as extra demonstrations for the users. I will automate the tests as soon as I have more time (which will be in a couple of months). Will that be satisfactory for the purpose of the review?

gkthiruvathukal commented 3 years ago

@bshoshany I know. Concurrency testing is hard. I think this is not a problem. Proceed!

bshoshany commented 3 years ago

@gkthiruvathukal Will do! :)

bshoshany commented 3 years ago

Dear @gkthiruvathukal and @zeroset: I just uploaded a new release, v2.0.0. I had a bit of free time, so I started working on some tests and eventually decided to try to automate them. It was so much fun that I ended up spending 2 full days on it, and I'm very happy with the result!

The file thread_pool_test.cpp will perform automated tests of all aspects of the package, and benchmark some multithreaded matrix operations. In addition, the code is thoroughly documented, and is meant to serve as an extensive example of how to properly use the package.

I hope this will be sufficient for you to be able to continue the review! :) Please let me know what you think, and if there's anything else you would like me to do to help the review process.

gkthiruvathukal commented 3 years ago

@bshoshany I'm on vacation through next week but this seems like a great effort to address concerns raised by @zeroset during review. Please proceed!

bshoshany commented 3 years ago

@gkthiruvathukal Thanks, have a nice vacation! :)

zeroset commented 3 years ago

Sorry for my late reply, the summer was quite busy for me too.

I still have problems with the "Substantial scholarly effort" for this submission. Even without taking LOC into consideration, I do not see the new substantial part. Actually, none of the bullet points in the Guidelines are fulfilled. Let me describe this from different perspectives:

Concluding, I would say, I do not see the "Substantial scholarly effort" in this submission. That's the reason why I would reject this submission.

Additional remarks:

In your Performance measurement, you state: "compiler optimizations already parallelize simple loops fairly well". The compiler does not introduce any concurrency, it only uses vectorization (SIMD) to "parallelize" your code. This argument does not work, because your multithreaded version could utilize the vectorization as well. Especially you should at least be able to use SIMD because otherwise you leave a lot of performance on the table (AVX2 factor 4 and AVX512 factor 8).

Additionally, I run the benchmark on a machine with an AMD EPYC 7702P 64-Core CPU and 256 GB of RAM (g++-10 on Debian 11). At least for Adding and Transposing, I can not reproduce the characteristics of your Benchmark. You can see the results here:

===================================
Performing matrix performance test:
===================================
Using 126 out of 128 threads.
Determining the optimal sleep duration........................
Result: The optimal sleep duration is 900 microseconds.

Adding two 25200x25200 matrices 20 times:
With   1  task, mean execution time was  787.6 ms with standard deviation 15.8 ms.
With  31 tasks, mean execution time was  182.2 ms with standard deviation 12.4 ms.
With  63 tasks, mean execution time was  188.4 ms with standard deviation  7.2 ms.
With 126 tasks, mean execution time was  220.3 ms with standard deviation  1.6 ms.
With 252 tasks, mean execution time was  222.1 ms with standard deviation  1.3 ms.
With 504 tasks, mean execution time was  223.4 ms with standard deviation  0.9 ms.
Maximum speedup obtained: 4.3x.

Transposing one 25200x25200 matrix 20 times:
With   1  task, mean execution time was 1513.4 ms with standard deviation 92.8 ms.
With  31 tasks, mean execution time was  182.9 ms with standard deviation 17.5 ms.
With  63 tasks, mean execution time was  213.1 ms with standard deviation 17.5 ms.
With 126 tasks, mean execution time was  334.0 ms with standard deviation 10.7 ms.
With 252 tasks, mean execution time was  264.8 ms with standard deviation  2.5 ms.
With 504 tasks, mean execution time was  275.1 ms with standard deviation  2.7 ms.
Maximum speedup obtained: 8.3x.

Multiplying two 3150x3150 matrices 20 times:
With   1  task, mean execution time was 81739.0 ms with standard deviation 1449.4 ms.
With  31 tasks, mean execution time was 3334.8 ms with standard deviation 90.8 ms.
With  63 tasks, mean execution time was 1491.8 ms with standard deviation 31.0 ms.
With 126 tasks, mean execution time was 1065.0 ms with standard deviation 51.1 ms.
With 252 tasks, mean execution time was 4980.1 ms with standard deviation 44.6 ms.
With 504 tasks, mean execution time was 4843.4 ms with standard deviation 47.2 ms.
Maximum speedup obtained: 76.8x.

Generating random 25200x25200 matrix 20 times:
With   1  task, mean execution time was 8975.4 ms with standard deviation 38.3 ms.
With  31 tasks, mean execution time was  368.9 ms with standard deviation 11.9 ms.
With  63 tasks, mean execution time was  218.4 ms with standard deviation  3.4 ms.
With 126 tasks, mean execution time was  167.8 ms with standard deviation  2.3 ms.
With 252 tasks, mean execution time was  165.0 ms with standard deviation  2.1 ms.
With 504 tasks, mean execution time was  149.1 ms with standard deviation  2.7 ms.
Maximum speedup obtained: 60.2x.

Overall, multithreading provided speedups of up to 76.8x.

+++++++++++++++++++++++++++++++++++++++
Thread pool performance test completed!
+++++++++++++++++++++++++++++++++++++++
bshoshany commented 3 years ago

Dear @zeroset,

I'm sorry to hear that you think the submission should be rejected.

Certainly, there are other thread pool libraries out there... but my library was designed to be lightweight, header-only, simple, and easy to use. Other libraries may provide better performance, but they have none of these traits. In fact, that's exactly why I decided to write my own thread pool - I wanted something small and simple for my own use, and couldn't find anything good online.

Kokkos, for example, has 944 (!) files in the repository, needs to be installed and built before using, and the API is extremely complicated. Meanwhile, using my thread pool is so much simpler - just download the single header file and include it, and that's it. All the information about the API is in a single README file, and is written in a simple and user-friendly way with many examples.

Will a well-established library, like the ones you mentioned, produce better performance? I'm sure it will, but it will also take much longer for the programmer to understand how to use it. And I think that's a feature that is actually very important for many people, especially people in academia (this paper's target audience) which may not be the most experienced programmers!

Can more functionality be added to the library, such as all the features you suggested? Undoubtedly so. And perhaps I will introduce some of them in future versions - I'm always working on improving the library and adding new features. But should all those features be added? I'm not sure. If I introduce too many features, then the library will become complicated and bloated, and that is exactly what I want to avoid.

My philosophy is that if someone wants all those advanced features, then there are already established libraries that have all of them, and they should definitely use those libraries. But if what they want is a simple and lightweight library that they can start using immediately after spending 20 minutes reading the short documentation, then they might choose to use my library instead!

Indeed, the library already has almost 200 stars on GitHub, 27 forks, and had 241 clones in the last two weeks alone. People are actively using it in their code, which means they chose it over other established libraries. If you read through the issues and pull requests, you will see that some users praise the library exactly for its simplicity.

That's what's special about my library, and what sets it apart from others. And most importantly, as I said before, its simplicity makes it potentially more useful than other libraries to scientists who are not experienced coders - which is exactly why I want to publish it on JOSS in the first place!

Finally, regarding performance, my claim that "compiler optimizations already parallelize simple loops fairly well on their own" was based on this text from the GCC documentation:

-ftree-parallelize-loops=n Parallelize loops, i.e., split their iteration space to run in n threads. This is only possible for loops whose iterations are independent and can be arbitrarily reordered. The optimization is only profitable on multiprocessor machines, for loops that are CPU-intensive, rather than constrained e.g. by memory bandwidth. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

But you are correct that this is not actually what's happening - I wrongly assumed this was enabled automatically with -O3 because that would explain why multithreading only produces a substantial speedup when optimizations are disabled. But after reading the description again now, I realize that this is not an option that is enabled automatically (in particular, one has to specify the number of threads n explicitly, so it doesn't even make sense for this to be enabled by default). That's a mistake on my part, and I'll fix it in the documentation in the next version.

Nevertheless, the fact that there is no substantial speedup for addition and transposition in this test does not indicate that there is anything wrong with the thread pool itself, since it clearly works very well for other operations (as well as for other calculations I'm doing as part of my ongoing research projects, such as relativistic ray tracing, where I am also seeing a very substantial speedup thanks to the thread pool). I imagine that the performance can be improved by adding some of the features you mentioned and utilizing them in the addition/transposition algorithm, and perhaps even just by improving the algorithm itself. I promise to look into that as soon as I can.

Still, for multiplication and random matrix generation the speedup is very substantial in all the systems I tested the library on, and even in your own test you get a 76.8x improvement for multiplication - which is 120% of the number of cores in your CPU and thus very close to the theoretical maximum hyperthreading speedup of 130%!

I hope you will consider rescinding your rejection of my submission based on these arguments. But even if not, I thank you anyway for taking the time to review my submission :)

gkthiruvathukal commented 3 years ago

Hi, I am just chiming in briefly here.

@zeroset, your points are valid; however, we already decided to put this through review and we rarely reject. If an author cannot overcome the issues arising from feedback, it is their job to pause or withdraw the submission. The "scholarly effort" may appear borderline, but I'm going to stand by my original assertion (from pre-review) that lightweight frameworks are worthy of consideration. JOSS does not have a novelty requirement when it comes to the software being submitted. We do, however, expect authors to explain how their work will help when it comes to other research software efforts. That is, has this library been used successfully or is there reason to believe it can be used successfully in other projects?

@bshoshany I think it is important to note that @zeroset is recommending reject, but you have a chance to rebut the points and try to make your case. There is also a second reviewer who may or may not share this opinion. I would like to see more discussion happen here and to hear both reviewers follow up to your response. Perhaps you can also give some thought to make it clear who the audience is and at least make mention of the related work and why your approach plays an important role in this discussion. (I, for one, mostly avoid complex frameworks in my own work, especially in lower-power computing environments.)

One area of concern, to me, is going to be the reproducibility issues raised by @zeroset. If this is going to be an emphasis of your work, the results do need to be reported rigorously to support your claims. You may also want to be more clear about what scaling can be expected, realistically, and in what environments. For example, would your approach make sense in non-HPC environments, e.g. edge devices with smaller core counts, etc.?

Let's please keep the discussion going and figure out what we can do to improve the submission. But I will ask that we not reject the submission based on matters discussed in pre-review and whether the work is novel. This is what sets JOSS apart from many other venues. Competing approaches can be accepted with proper rationale and justification.

bshoshany commented 3 years ago

@gkthiruvathukal Thank you for the clarifications! I'm happy to know that the submission will not be rejected! :)

I believe I have already rebutted all of the points made by @zeroset and made my case in my previous comment.

There are no issues of reproducibility. The fact that matrix addition does not enjoy much of a speedup due to multithreading compared to matrix multiplication is well-known, and is reflected in my own tests as given in README.md. On my PC I get 2.3x for addition vs 16.2x for multiplication, on an HPC node I get 13.0x vs. 52.7x, and on @zeroset's system he gets 4.3x vs 76.8x. So the fact that @zeroset also got a large discrepancy between the two operations means that he did reproduce the expected results.

As I said above, I'm not entirely sure why matrix addition doesn't enjoy the huge speedup that matrix multiplication enjoys. I've been thinking about it, and my theory is that matrix multiplication, which works by taking the dot product of the rows of the first matrix with the columns of the second matrix, can benefit from first copying the individual row and column to the individual core's L1 cache, and then operating on them independently of the other threads. Copying to the cache takes some time, but negligible compared to the dot product itself.

On the other hand, matrix addition simply adds the elements one after the other, and if the particular range of elements being added is copied to the cache, that doesn't really speed things up since the addition itself is so fast anyway. So the threads are all trying to access different locations in the same memory region at the same time, which significantly impedes performance.

In any case, the lack of speedup in matrix addition is due to limitations of the addition algorithm itself, and (as far as I can tell) does not indicate any issues with the thread pool implementation. As you can see, matrix multiplication does achieve almost the maximum speedup possible in both my and @zeroset's tests, which means the thread pool itself is working as expected and there are no issues with its performance.

Regarding the expected scaling, it clearly strongly depends on the algorithm, as the matrix addition vs. multiplication results show. Having more cores makes matrix multiplication proportionally faster, but the same does not apply to matrix addition. With the right algorithm, the user can expect a speedup with any number of cores greater than 1, including low-core-count / non-HPC environments (verified on my laptop with 2 cores and 4 threads).

Here are some improvements I can make to the submission based on your and @zeroset's comments:

  1. I can change paper.md to put more stress on the fact that the whole point of this thread pool library is that it is simple and lightweight, and compare it with some of the big and complex libraries mentioned by @zeroset.
  2. I can add a clarification to paper.md that the expected scaling strongly depends on the algorithm, as discussed above, and that the library can work with both low and high core count.
  3. I can find another matrix operation (perhaps determinant?) that sees a substantial speedup from multithreading, and add it to the performance test. This will provide additional evidence that the thread pool produces a significant and scalable speedup with the right algorithm.

@gkthiruvathukal and @zeroset, please let me know if you think these improvements will resolve the issues, and if there is anything else I should do to improve the submission. Thanks!

bshoshany commented 3 years ago

Dear @gkthiruvathukal, @zeroset, and @cedricchevalier19,

To add to my previous post - there is a simpler explanation for why matrix addition isn't seeing a substantial performance boost. Although the thread pool eliminates the substantial overhead of creating and destroying threads, there is still some overhead to the parallelization in the form of splitting the operation into tasks, submitting them to the pool, and executing them by the threads. Since matrix addition is a very simple linear operation that doesn't take much time to execute on its own, this overhead negates much of the benefit of multithreading.

If single-threaded matrix addition takes N ms, then splitting it into T threads means that (ideally) each thread runs for N/T ms. However, if the overhead of parallelization is X ms, then the total runtime will be (N/T + X) ms. If X > N(1 - 1/T) then there is no benefit at all to multithreading. But even if X is smaller, there will still be diminishing returns. This is why we do see some speedup (around 4x in @zeroset's test), but presumably in that case we have X > N(1 - 1/4) which is why we don't see any additional benefit for T>4.

As a test of this hypothesis, we can artificially make matrix addition take longer, for example by repeating the addition operation multiple times for each element:

pool.parallelize_loop(
    0, a.rows * a.cols, [&a, &b, &c, &num_add](const ui64 &start, const ui64 &end)
    {
        for (ui64 i = start; i < end; i++)
            for (ui32 n = 0; n < num_add; n++)
                c[i] = a[i] + b[i];
    },
    num_tasks);

When I run this test, I find that by increasing num_add I can essentially get as much speedup as the CPU can provide me with - in my testing on a 12-core/24-thread CPU I was able to get all the way up to a 19x speedup! The reason this happens is that when the operation takes longer, the overhead becomes negligible relative to the operation itself.

It's been a while since my last post, and I'm still waiting for your reply. Furthermore, it's been 4 months since the start of this review, and I still haven't received any comments from the other reviewer, @cedricchevalier19. I would love to be able to advance this review further, and as I said, I'm more than happy to make improvements to both the paper and the code based on your comments, as detailed in my previous post.

Looking forward to hearing from all of you! Barak

gkthiruvathukal commented 3 years ago

@zeroset I was wondering whether you have any follow up to @bshoshany's follow-up. It seems like Barak is working to address your feedback. Would the improvements he is suggesting (two comments ago) be sufficient to move the needle forward?

Also, we definitely need to hear from @cedricchevalier19. Can you let us know your input on this submission.

@bshoshany Can you speak more specifically to research projects (your own or others) that are making use of your framework? I'm assuming the ultimate target for this is not HPC workflows but would like to have a better idea of where this framework makes the most sense.

bshoshany commented 3 years ago

@gkthiruvathukal thank you so much for the quick reply!

Regarding research projects: The whole reason I started writing this thread pool in the first place was to use in my own research project, which was intended to implement general-relativistic ray-tracing in order to visualize interesting spacetime geometries such as black holes, wormholes, and warp drives. Unfortunately, that project is still in the planning stage, since I never found the time to start working on the ray-tracing code itself :(

The thread pool library has a Zenodo DOI and an arXiv ID, but as far as I know it did not get any citations thus far. Therefore, I am not aware of any specific research projects making use of my library. However, keep in mind that the library has only been out less than 5 months (the first release was April 24, 2021), and only became popular in the last few months. That's not enough time to publish a paper. Nevertheless, the thread pool is certainly being used by many people - it currently has 226 stars - but I don't know if they are using it for scientific software or for other purposes.

Regarding target: I believe this framework is suitable for use with any kind of multithreaded CPU - from a low-core laptop, through a mid-core desktop, up to a high-core HPC node. Indeed, I tested it and found that it worked well on all three types of systems: my 2-core/4-thread laptop, my 12-core/24-thread desktop, and a 40-core/8-thread Compute Canada node. (Of course, the thread pool is only intended to be used for multithreading on a single computer, not on distributed systems.)

In terms of the target user base, I am mainly aiming at users who may not be too experienced with C++ and/or multithreading, and are looking for a library that is lightweight, simple, easy to use, and quick to learn. This is why it is important to me to have a simple and compact interface, thorough documentation with comprehensive examples, and detailed comments in the code itself. It is also why the library is distributed as a single header file, so that people can start using it immediately, without any complicated setup.

gkthiruvathukal commented 2 years ago

@bshoshany Thanks for your follow up. I apologize for the delay on my end as September is the craziest month in my academic year (on average). I'm satisfied with your responses. The number of stars on your project is impressive. It does seem like there are some who are using your library.

We do not require citations of your existing Zenodo archive. One of the goals of JOSS is to improve the likelihood of citations, assuming successful peer review.

@zeroset Can you take a look at @bshoshany's response above and let me know if this moves you either way. I actually see the points you are making. I think the matrix multiplication is not (in general) the most compelling use case for the thread pool in general, which is probably why the results seem a little underwhelming.

@cedricchevalier19 We still have no input from you for this submission. If I don't hear from you soon, I need @bshoshany to help identify a new reviewer to take your place. We'd really benefit from your input at this stage.

cedricchevalier19 commented 2 years ago

Sorry for the huge delay, I was unable to test the software until recently. I catch up today and will provide some comments. Sorry again.

cedricchevalier19 commented 2 years ago

I tend to agree with @zeroset for his concern about the "Substantial scholarly effort".

I like the fact that the library is small and self contained, however, I think the minimal "thread pool" goal is somehow missed. By providing advanced features such as parallelize_loop, it does more than just thread pooling and thus will be compared against Kokkos, TBB, Sycl or even OpenMP. Once we have a parallel for, we will expect a parallel result ...

Futhermore, C++17 provides parallel semantic through parallel stl that is now even implemented for gpus (for example, see http://www.mycpu.org/stdpar-c++/).

To precise my thoughts, I was expecting a library that just performs a smart ordering and caching of threads.

However, I looks like a task-based framework that do not allow to express dependencies. On the thread pooling side, a lot of things can be done such as improving hardware locality. Starpu provides such thing, but without the nice C++ interface of the reviewed project.

Some more comments about the explanations in the README:

To put it in a nutshell, I think this library has a nice API to do basic thread management. However, it does not look particularly well suited for High-Performance Scientific Computing. C++17 parallel stl, Kokkos, Sycl provide more functionalities and probably more performance portability (dealing with NUMA effects, task dependencies, offloading, ...). A clear statement on how this library compares against state-of-the-art approaches is mandatory for HPC considerations.

However, for small projects, on almost embedded architectures, this library can make sense.

@gkthiruvathukal, sorry again for the delay. I have kind of mixed feelings about this review: the library is interesting, and thanks to @bshoshany efforts easy to use. However, on a science point of view, there is nothing really new, and I am not sure that the author knows how its work compares to classic alternatives. Reading one more time https://joss.readthedocs.io/en/latest/submitting.html#substantial-scholarly-effort, I think this library is a “utility” package that I tend to consider as "minor".

gkthiruvathukal commented 2 years ago

@cedricchevalier19 Thank you for your detailed review. It is becoming increasingly clear that primary issues in this submission are:

I must admit that I, too, struggled with the second concern (more than the first). I also concur with your and @zeroset's review that the matrix multiplication example does not amount to a compelling use for a thread pool. Even in my own projects, thread pools are best suited to irregular workloads.

The question before us is whether this submission can be accepted. I think it can (with the understanding that some major modifications are truly needed). I must stress that novelty is not an absolute requirement for acceptance (or rejection) at JOSS. I realize that it is a borderline situation in terms of scholarly effort, but we did discuss it in pre-review and determined we could move forward based on the code size and well-put-together documentation.

As I see it, there are a few strengths to this submission, despite the weaknesses identified above.

One thought I have is that we could give @bshoshany the option of refocusing the submission in a couple of ways, but these might not be agreeable. Nevertheless here is what I think:

My feeling is that all of these would require significant effort by @bshoshany. I'd also like to know whether @zeroset and @cedricchevalier19 agree with my synthesis. At JOSS, once we put something through peer review, we really try to give authors a chance.

cedricchevalier19 commented 2 years ago

@gkthiruvathukal I totally agree with your synthesis.

As for the example of irregular workload, a sparse matrix-matrix multiplication or a sparse triangular solve can be better suited for exploiting threads pool.

@bshoshany's work is clearly of interest and it would be great to be able to cite it with a JOSS reference.