Closed bjoo closed 1 year ago
Do you know if initCommsGridQuda
is ever called here? How is QUDA initialized?
Hi All, I have several reports about folks coming accross this error:
Initializing QUDA device (using CUDA device no. 0) ERROR: Current communicator can't be found. (rank 2, host frontier00104, communicator_stack.cpp:37 in Communicator &quda::get_current_communicator()())
Primarily from HIP builds, but at the same time I am wondering if it is more of a QDP-JIT interaction, which I seem to recall had to start QUDA in two phases.... I have repro-d on frontier and am looking further but if there is a quick discussion of where this change came from that could be very helpful.
Hi, we had also the same issues. As a temporary solution we included int comm_rank_from_coords(const int *coords);
to "include/comm_quda.h" which does not require a Topology pointer, returned by quda::get_current_communicator()()).
The initialization code looks like so (Chroma + QDP-JIT). The code is from chroma/lib/init/chroma_init.cc with some obfuscating #ifdef’s removed.
std::cout << "Setting CUDA device" << std::endl;
int cuda_device = QDP_setGPU();
std::cout << "Initializing QMP part" << std::endl;
QDP_initialize_QMP(argc, argv);
setVerbosityQuda(QUDA_SUMMARIZE, "", stdout);
QDPIO::cout << "Initializing QUDA device (using CUDA device no. " << cuda_device << ")" << std::endl;
// Init QUDA device
initQudaDevice(cuda_device);
QDPIO::cout << "Initializing QDP-JIT GPUs" << std::endl;
QDP_startGPU();
QDPIO::cout << "Initializing QUDA memory" << std::endl;
initQudaMemory();
There is not a concrete call to initQuda() --- There are some historical reasons for doing this, (I think one was wanting to start the comms for QDP++ here - rather than in QUDA) but I don’t remember the reasons fully. Frank, do you recall the reason?
In any case I am wondering if this came in with feature/tune_rank – right now I have something working having rolled back to
commit 3f560f0e6cd2d33c7d1efedee076099e32f1dde7 (HEAD)
Merge: 493466fa7 830c9dab1
Author: maddyscientist ***@***.******@***.***>
Date: Fri Apr 7 17:52:53 2023 -0700
Merge pull request #1369 from lattice/hotfix/stackframe
Hotfix/stackframe
I will continue bisecting until I find the offending commit.
Balint Joo, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory P.O. Box 2008, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA email: joob AT ornl.gov. Tel: +1-757-912-0566 (cell, remote)
From: Jiqun Tu @.> Reply-To: lattice/quda @.> Date: Friday, April 28, 2023 at 12:13 PM To: lattice/quda @.> Cc: "Joo, Balint" @.>, Author @.***> Subject: [EXTERNAL] Re: [lattice/quda] Error finding a Communicator in quda::get_current_communicator() when running Chroma + QDP_JIT (Issue #1375)
Do you know if initCommsGridQuda is ever called here? How is QUDA initialized?
— Reply to this email directly, view it on GitHubhttps://github.com/lattice/quda/issues/1375#issuecomment-1527788972, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEPL2GSLYKOP65QRYPEPDLXDPUDFANCNFSM6AAAAAAXPMOMHA. You are receiving this because you authored the thread.Message ID: @.***>
I believe it is due to this commit: https://github.com/lattice/quda/commit/25a83a6f7a610199e9e8d44e2d4e537e35b113ad. comm_rank()
is tied to a specific communicator but when it is called there is no communicator there. comm_rank_global()
is not tied to communicators. @maddyscientist do you know why comm_rank_global()
is not sufficient here?
I think I can fix this with an explicit call to initCommsGridQuda
from Chroma before calling the initQudaDevice()
.
Checking this now. Thanks for the handy hints,
Best, B
-- Balint Joo, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory P.O. Box 2008, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA email: joob AT ornl.gov. Tel: +1-757-912-0566 (cell, remote)
From: Jiqun Tu @.> Reply-To: lattice/quda @.> Date: Friday, April 28, 2023 at 3:38 PM To: lattice/quda @.> Cc: "Joo, Balint" @.>, Author @.***> Subject: [EXTERNAL] Re: [lattice/quda] Error finding a Communicator in quda::get_current_communicator() when running Chroma + QDP_JIT (Issue #1375)
I believe it is due to this commit: 25a83a6https://github.com/lattice/quda/commit/25a83a6f7a610199e9e8d44e2d4e537e35b113ad. comm_rank() is tied to a specific communicator but when it is called there is no communicator there. comm_rank_global() is not tied to communicators.
— Reply to this email directly, view it on GitHubhttps://github.com/lattice/quda/issues/1375#issuecomment-1528006351, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEPL2ERDJP3VRCTKKFL5EDXDQMBRANCNFSM6AAAAAAXPMOMHA. You are receiving this because you authored the thread.Message ID: @.***>
I might have seen the same thing while running chroma/quda-sycl on Intel gpus without qdp-jit. If you pass -geom x y z t
to chroma, qmp would gets initialized properly from the qdp_initialize_qmp, and that solves the missing communicator issue in quda.
Indeed, typically the layout is not set up until the user calls ‘Layout::create()’ (which is where the logical geometry gets declared) which is called after QDP_initialize() in general in user code. Technically the logical topology need not be set until this is called. However QDP-JIT sets it if you give the --geom
parameters. With the tune-rank changes we now need to explicitly tell QUDA about this. Howver, the fix will be in Chroma rather than QUDA, since it is the QDP-JIT build of Chroma that initializes QUDA in such a different way.
Thanks everyone, I am testing now, and hope to close the issue soon.
-- Balint Joo, Oak Ridge Leadership Computing Facility, Oak Ridge National Laboratory P.O. Box 2008, 1 Bethel Valley Road, Oak Ridge, TN 37831, USA email: joob AT ornl.gov. Tel: +1-757-912-0566 (cell, remote)
From: Xiao-Yong Jin @.> Reply-To: lattice/quda @.> Date: Friday, April 28, 2023 at 4:01 PM To: lattice/quda @.> Cc: "Joo, Balint" @.>, Author @.***> Subject: [EXTERNAL] Re: [lattice/quda] Error finding a Communicator in quda::get_current_communicator() when running Chroma + QDP_JIT (Issue #1375)
I might have seen the same thing while running chroma/quda-sycl on Intel gpus without qdp-jit. If you pass -geom x y z t to chroma, qmp would gets initialized properly from the qdp_initialize_qmp, and that solves the missing communicator issue in quda.
— Reply to this email directly, view it on GitHubhttps://github.com/lattice/quda/issues/1375#issuecomment-1528027997, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAEPL2A7IUHLGDQ534J5WHTXDQOZXANCNFSM6AAAAAAXPMOMHA. You are receiving this because you authored the thread.Message ID: @.***>
Fixed in chroma devel
branch (commit: devel 9f6f17448)
Hi All, I have several reports about folks coming accross this error:
Primarily from HIP builds, but at the same time I am wondering if it is more of a QDP-JIT interaction, which I seem to recall had to start QUDA in two phases.... I have repro-d on frontier and am looking further but if there is a quick discussion of where this change came from that could be very helpful.