Open DenSto opened 3 years ago
A good place to get started on this is to look how the MPs shared memory framework is used in response_matrix.fpp. Additionally, one should get familiar with the mpi communicators created in utils/mp.fpp, particularly comm_shared and comm_node.
One more consideration: Due to the stellarator nature of stella, keep in mind that ky is actually the fast index for the distribution function/field arrays, so parallelizing over ky might not the best idea. The beauty of the shared memory approach however is that one is free to parallelize over any of the spatial indices, wherever convenient, and if one decides to transpose the arrays around the spatial indices (i.e. put ky in the last local index), that can be done without any MPI communications, so no all-to-all communications are needed.
In spatial localisation we are including x, y, and zed?
On Sep 26, 2021, at 6:19 AM, Denis St-Onge @.***> wrote:
One more consideration: Due to the stellarator nature of stella, keep in mind that ky is actually the fast index for the distribution function/field arrays, so parallelizing over ky might not the best idea. The beauty of the shared memory approach however is that one is free to parallelize over any of the spatial indices, wherever convenient, and if one decides to transpose the arrays around the spatial indices (i.e. put ky in the last local index), that can be done without any MPI communications, so no all-to-all communications are needed.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/stellaGK/stella/issues/23#issuecomment-927277880, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBHCNEDHK6E4DCALRKRQYDUD3XSHANCNFSM5EYBHCOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Yes, that's right.
In spatial localisation we are including x, y, and zed? … On Sep 26, 2021, at 6:19 AM, Denis St-Onge @.***> wrote: One more consideration: Due to the stellarator nature of stella, keep in mind that ky is actually the fast index for the distribution function/field arrays, so parallelizing over ky might not the best idea. The beauty of the shared memory approach however is that one is free to parallelize over any of the spatial indices, wherever convenient, and if one decides to transpose the arrays around the spatial indices (i.e. put ky in the last local index), that can be done without any MPI communications, so no all-to-all communications are needed. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#23 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGBHCNEDHK6E4DCALRKRQYDUD3XSHANCNFSM5EYBHCOQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Some more thoughts:
stella
yet, so we might be OK...
Currently stella's scaleability is limited by its parallelisation of velocity space alone, relying on a redistribution between a space-local grid and a velocity-local grid. I think the fastest and most straightforward way to achieve more scaleability with what's already in Stella is to use a hybrid shared memory approach. While this is typically done with MPI + openmp, stella already takes advantage of MPI's shared memory framework using windows and mixed communicators, so I think a MPI + MPI approach would be much faster to get to production.
The idea is this: instead of distributing the velocity grid over the number of available cores, instead distribution over the number of available nodes. Then on a node, most of the operations can be parallelized over naky (and when the time comes, nakx for the nonlinearity). This would require a number of small modifications in a few subroutines, rather than needing to rewrite array sizes, create new layouts and redistributors, etc...
The question that remains is can this be exploited for the velocity-local operations as well, such as the mirror term and collisions? In principle the latter should be doable, since mu acts like ky here. For collisions, this is more unclear, unless the vpa and mu operators are always decoupled (like for the Dougherty operator).