Open lcosmai opened 8 months ago
Hi Leonardo,
I expect you should be using the 'develop' branch of milc_qcd rather than 'master'.
--jim
-----Original Message----- From: Leonardo Cosmai @. @.>> Reply-To: milc-qcd/milc_qcd @. @.>> Date: Friday, November 3, 2023 at 12:07 PM To: milc-qcd/milc_qcd @. @.>> Cc: Subscribed @. @.>> Subject: [milc-qcd/milc_qcd] MILC code (master branch) with QUDA version 1.1.0 ---> HISQ fattening unitarization error (Issue #62)
I successfully compiled the MILC code (master branch) with QUDA version 1.1.0 using CUDA v11.8. The QUDA compilation passed all the tests. I compiled the su3_rhmc_hisq target for ks_imp_rhmc. I then launched a test job on 4 nodes, each with 4 Nvidia A100 GPUs. The job aborted with the following error: “ ERROR: Error in unitarization component of the hisq fattening: 1048576 failures (/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_QUDA/quda-1.1.0/lib/interface_quda.cpp:4154 in computeKSLinkQuda()) “ Could you please provide any suggestions on how to resolve this issue? Best regards, Leonardo — Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=NIv_t3TGac50ORmaBEMLYzPlmBmiLjogVmN1sYRCmDk&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=NIv_t3TGac50ORmaBEMLYzPlmBmiLjogVmN1sYRCmDk&e=>, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTSNLQSR43KAEPMO73LYCUQGBAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TMNRRGE3DEMQ&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=PxEgQK042V3qw2aKnKOoOrkp125ZigfNRVhnCNuCq3k&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTSNLQSR43KAEPMO73LYCUQGBAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TMNRRGE3DEMQ&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=PxEgQK042V3qw2aKnKOoOrkp125ZigfNRVhnCNuCq3k&e=>. You are receiving this because you are subscribed to this thread.Message ID: @. @.>>
Hi Jim,
Thanks for your suggestion.
Unfortunately, when I tried to compile milc_qcd-develop, I received the following error messages:
../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_delete_2link_QUDA':
../generic_ks/gauss_smear_ks_QUDA.c:51:3: warning: implicit declaration of function 'qudaFreeTwoLink' [-Wimplicit-function-declaration]
51 | qudaFreeTwoLink();
| ^~~~~~~
../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_v_field_QUDA':
../generic_ks/gauss_smear_ks_QUDA.c:106:3: error: unknown type name 'QudaTwoLinkQuarkSmearArgs_t'
106 | QudaTwoLinkQuarkSmearArgs_t qsmear_args;
| ^~~~~~~
../generic_ks/gauss_smear_ks_QUDA.c:107:14: error: request for member 'n_steps' in something not a structure or union
107 | qsmear_args.n_steps = iters;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:108:14: error: request for member 'width' in something not a structure or union
108 | qsmear_args.width = width;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:109:14: error: request for member 'compute_2link' in something not a structure or union
109 | qsmear_args.compute_2link = compute_2link_temp;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:110:14: error: request for member 'delete_2link' in something not a structure or union
110 | qsmear_args.delete_2link = 0;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:111:14: error: request for member 't0' in something not a structure or union
111 | qsmear_args.t0 = t0;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:112:14: error: request for member 'laplaceDim' in something not a structure or union
112 | qsmear_args.laplaceDim = laplaceDim;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:115:3: warning: implicit declaration of function 'qudaTwoLinkGaussianSmear' [-Wimplicit-function-declaration]
115 | qudaTwoLinkGaussianSmear( MILC_PRECISION, MILC_PRECISION, (void) t_links, (void) src, qsmear_args );
| ^~~~~~~~
make[1]: [../generic_ks/Make_template:384: gauss_smear_ks_QUDA.o] Error 1
make[1]: Leaving directory '/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_MILC/milc_qcd-develop/ks_imp_rhmc'
make: [Make_template:223: su3_rhmc_hisq] Error 2
Hi,
Double check that you have -DQUDA_SMEAR_GAUSS_TWOLINK=ON in the quad cmake step.
-----Original Message----- From: Leonardo Cosmai @. @.>> Reply-To: milc-qcd/milc_qcd @. @.>> Date: Friday, November 3, 2023 at 1:53 PM To: milc-qcd/milc_qcd @. @.>> Cc: James N Simone @. @.>>, Comment @. @.>> Subject: Re: [milc-qcd/milc_qcd] MILC code (master branch) with QUDA version 1.1.0 ---> HISQ fattening unitarization error (Issue #62)
Hi Jim,
Thanks for your suggestion.
Unfortunately, when I tried to compile milc_qcd-develop, I received the following error messages:
../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_delete_2link_QUDA':
../generic_ks/gauss_smear_ks_QUDA.c:51:3: warning: implicit declaration of function 'qudaFreeTwoLink' [-Wimplicit-function-declaration]
51 | qudaFreeTwoLink();
| ^~~~~~~
../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_v_field_QUDA':
../generic_ks/gauss_smear_ks_QUDA.c:106:3: error: unknown type name 'QudaTwoLinkQuarkSmearArgs_t'
106 | QudaTwoLinkQuarkSmearArgs_t qsmear_args;
| ^~~~~~~
../generic_ks/gauss_smear_ks_QUDA.c:107:14: error: request for member 'n_steps' in something not a structure or union
107 | qsmear_args.n_steps = iters;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:108:14: error: request for member 'width' in something not a structure or union
108 | qsmear_args.width = width;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:109:14: error: request for member 'compute_2link' in something not a structure or union
109 | qsmear_args.compute_2link = compute_2link_temp;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:110:14: error: request for member 'delete_2link' in something not a structure or union
110 | qsmear_args.delete_2link = 0;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:111:14: error: request for member 't0' in something not a structure or union
111 | qsmear_args.t0 = t0;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:112:14: error: request for member 'laplaceDim' in something not a structure or union
112 | qsmear_args.laplaceDim = laplaceDim;
| ^
../generic_ks/gauss_smear_ks_QUDA.c:115:3: warning: implicit declaration of function 'qudaTwoLinkGaussianSmear' [-Wimplicit-function-declaration]
115 | qudaTwoLinkGaussianSmear( MILC_PRECISION, MILC_PRECISION, (void) t_links, (void) src, qsmear_args );
| ^~~~~~~~
make[1]: [../generic_ks/Make_template:384: gauss_smear_ks_QUDA.o] Error 1
make[1]: Leaving directory '/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_MILC/milc_qcd-develop/ks_imp_rhmc'
make: [Make_template:223: su3_rhmc_hisq] Error 2
—
Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62-23issuecomment-2D1792962103&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=9uGOZL6cMU2bVe_5oUCtqT1MVgRbaPTgR5Xjry3QXHs&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62-23issuecomment-2D1792962103&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=9uGOZL6cMU2bVe_5oUCtqT1MVgRbaPTgR5Xjry3QXHs&e=>, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTX2YYFBZRFEOPT3FELYCU4QRAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHE3DEMJQGM&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=t8M4vl5Af6_C27ckjPzJGQML9gzU3R7MVyHnx5F7cbQ&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTX2YYFBZRFEOPT3FELYCU4QRAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHE3DEMJQGM&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=t8M4vl5Af6_C27ckjPzJGQML9gzU3R7MVyHnx5F7cbQ&e=>.
You are receiving this because you commented.Message ID: @. @.>>
@lcosmai you also need to use the develop version of QUDA from GitHub. We haven’t made a release tag since the two link smearing support was merged in. Thx
Following your suggestions, I successfully compiled the develop branch of QUDA with OPENMPI (-DQUDA_MPI=ON) and the develop branch of the MILC code (-DQUDA_SMEAR_GAUSS_TWOLINK=ON).
I also tested the MILC code on a GPU cluster equipped with 4 NVIDIA Ampere GPUs, 64GB HBM2, and 32 Intel Ice Lake cores per node.
I appreciate your kind support.
I successfully compiled the MILC code (master branch) with QUDA version 1.1.0 using CUDA v11.8.
The QUDA compilation passed all the tests.
I compiled the su3_rhmc_hisq target for ks_imp_rhmc.
I then launched a test job on 4 nodes, each with 4 Nvidia A100 GPUs.
The job aborted with the following error:
“ ERROR: Error in unitarization component of the hisq fattening: 1048576 failures (/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_QUDA/quda-1.1.0/lib/interface_quda.cpp:4154 in computeKSLinkQuda()) “
Could you please provide any suggestions on how to resolve this issue?
Best regards, Leonardo