milc-qcd / milc_qcd

MILC collaboration code for lattice QCD calculations
Other
34 stars 32 forks source link

MILC code (master branch) with QUDA version 1.1.0 ---> HISQ fattening unitarization error #62

Open lcosmai opened 8 months ago

lcosmai commented 8 months ago

I successfully compiled the MILC code (master branch) with QUDA version 1.1.0 using CUDA v11.8.

The QUDA compilation passed all the tests.

I compiled the su3_rhmc_hisq target for ks_imp_rhmc.

I then launched a test job on 4 nodes, each with 4 Nvidia A100 GPUs.

The job aborted with the following error:

“ ERROR: Error in unitarization component of the hisq fattening: 1048576 failures (/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_QUDA/quda-1.1.0/lib/interface_quda.cpp:4154 in computeKSLinkQuda()) “

Could you please provide any suggestions on how to resolve this issue?

Best regards, Leonardo

james-simone commented 8 months ago

Hi Leonardo,

I expect you should be using the 'develop' branch of milc_qcd rather than 'master'.

--jim

-----Original Message----- From: Leonardo Cosmai @. @.>> Reply-To: milc-qcd/milc_qcd @. @.>> Date: Friday, November 3, 2023 at 12:07 PM To: milc-qcd/milc_qcd @. @.>> Cc: Subscribed @. @.>> Subject: [milc-qcd/milc_qcd] MILC code (master branch) with QUDA version 1.1.0 ---> HISQ fattening unitarization error (Issue #62)

I successfully compiled the MILC code (master branch) with QUDA version 1.1.0 using CUDA v11.8. The QUDA compilation passed all the tests. I compiled the su3_rhmc_hisq target for ks_imp_rhmc. I then launched a test job on 4 nodes, each with 4 Nvidia A100 GPUs. The job aborted with the following error: “ ERROR: Error in unitarization component of the hisq fattening: 1048576 failures (/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_QUDA/quda-1.1.0/lib/interface_quda.cpp:4154 in computeKSLinkQuda()) “ Could you please provide any suggestions on how to resolve this issue? Best regards, Leonardo — Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=NIv_t3TGac50ORmaBEMLYzPlmBmiLjogVmN1sYRCmDk&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=NIv_t3TGac50ORmaBEMLYzPlmBmiLjogVmN1sYRCmDk&e=>, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTSNLQSR43KAEPMO73LYCUQGBAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TMNRRGE3DEMQ&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=PxEgQK042V3qw2aKnKOoOrkp125ZigfNRVhnCNuCq3k&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTSNLQSR43KAEPMO73LYCUQGBAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3TMNRRGE3DEMQ&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=bCRcFK-TZO1RS9DZEjEVlFiKn5ptU3pkChGkE7N0D_NKgSMCuhlEXis4VT_wcDdz&s=PxEgQK042V3qw2aKnKOoOrkp125ZigfNRVhnCNuCq3k&e=>. You are receiving this because you are subscribed to this thread.Message ID: @. @.>>

lcosmai commented 8 months ago

Hi Jim,

Thanks for your suggestion.

Unfortunately, when I tried to compile milc_qcd-develop, I received the following error messages:

../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_delete_2link_QUDA': ../generic_ks/gauss_smear_ks_QUDA.c:51:3: warning: implicit declaration of function 'qudaFreeTwoLink' [-Wimplicit-function-declaration] 51 | qudaFreeTwoLink(); | ^~~~~~~ ../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_v_field_QUDA': ../generic_ks/gauss_smear_ks_QUDA.c:106:3: error: unknown type name 'QudaTwoLinkQuarkSmearArgs_t' 106 | QudaTwoLinkQuarkSmearArgs_t qsmear_args; | ^~~~~~~ ../generic_ks/gauss_smear_ks_QUDA.c:107:14: error: request for member 'n_steps' in something not a structure or union 107 | qsmear_args.n_steps = iters; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:108:14: error: request for member 'width' in something not a structure or union 108 | qsmear_args.width = width; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:109:14: error: request for member 'compute_2link' in something not a structure or union 109 | qsmear_args.compute_2link = compute_2link_temp; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:110:14: error: request for member 'delete_2link' in something not a structure or union 110 | qsmear_args.delete_2link = 0; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:111:14: error: request for member 't0' in something not a structure or union 111 | qsmear_args.t0 = t0; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:112:14: error: request for member 'laplaceDim' in something not a structure or union 112 | qsmear_args.laplaceDim = laplaceDim; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:115:3: warning: implicit declaration of function 'qudaTwoLinkGaussianSmear' [-Wimplicit-function-declaration] 115 | qudaTwoLinkGaussianSmear( MILC_PRECISION, MILC_PRECISION, (void) t_links, (void) src, qsmear_args ); | ^~~~~~~~ make[1]: [../generic_ks/Make_template:384: gauss_smear_ks_QUDA.o] Error 1 make[1]: Leaving directory '/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_MILC/milc_qcd-develop/ks_imp_rhmc' make: [Make_template:223: su3_rhmc_hisq] Error 2

james-simone commented 8 months ago

Hi,

Double check that you have -DQUDA_SMEAR_GAUSS_TWOLINK=ON in the quad cmake step.

-----Original Message----- From: Leonardo Cosmai @. @.>> Reply-To: milc-qcd/milc_qcd @. @.>> Date: Friday, November 3, 2023 at 1:53 PM To: milc-qcd/milc_qcd @. @.>> Cc: James N Simone @. @.>>, Comment @. @.>> Subject: Re: [milc-qcd/milc_qcd] MILC code (master branch) with QUDA version 1.1.0 ---> HISQ fattening unitarization error (Issue #62)

Hi Jim, Thanks for your suggestion. Unfortunately, when I tried to compile milc_qcd-develop, I received the following error messages: ../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_delete_2link_QUDA': ../generic_ks/gauss_smear_ks_QUDA.c:51:3: warning: implicit declaration of function 'qudaFreeTwoLink' [-Wimplicit-function-declaration] 51 | qudaFreeTwoLink(); | ^~~~~~~ ../generic_ks/gauss_smear_ks_QUDA.c: In function 'gauss_smear_v_field_QUDA': ../generic_ks/gauss_smear_ks_QUDA.c:106:3: error: unknown type name 'QudaTwoLinkQuarkSmearArgs_t' 106 | QudaTwoLinkQuarkSmearArgs_t qsmear_args; | ^~~~~~~ ../generic_ks/gauss_smear_ks_QUDA.c:107:14: error: request for member 'n_steps' in something not a structure or union 107 | qsmear_args.n_steps = iters; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:108:14: error: request for member 'width' in something not a structure or union 108 | qsmear_args.width = width; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:109:14: error: request for member 'compute_2link' in something not a structure or union 109 | qsmear_args.compute_2link = compute_2link_temp; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:110:14: error: request for member 'delete_2link' in something not a structure or union 110 | qsmear_args.delete_2link = 0; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:111:14: error: request for member 't0' in something not a structure or union 111 | qsmear_args.t0 = t0; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:112:14: error: request for member 'laplaceDim' in something not a structure or union 112 | qsmear_args.laplaceDim = laplaceDim; | ^ ../generic_ks/gauss_smear_ks_QUDA.c:115:3: warning: implicit declaration of function 'qudaTwoLinkGaussianSmear' [-Wimplicit-function-declaration] 115 | qudaTwoLinkGaussianSmear( MILC_PRECISION, MILC_PRECISION, (void) t_links, (void) src, qsmear_args ); | ^~~~~~~~ make[1]: [../generic_ks/Make_template:384: gauss_smear_ks_QUDA.o] Error 1 make[1]: Leaving directory '/leonardo/pub/userexternal/lcosmai0/AREA_COMPILAZIONE_MILC/milc_qcd-develop/ks_imp_rhmc' make: [Make_template:223: su3_rhmc_hisq] Error 2 — Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62-23issuecomment-2D1792962103&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=9uGOZL6cMU2bVe_5oUCtqT1MVgRbaPTgR5Xjry3QXHs&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_milc-2Dqcd_milc-5Fqcd_issues_62-23issuecomment-2D1792962103&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=9uGOZL6cMU2bVe_5oUCtqT1MVgRbaPTgR5Xjry3QXHs&e=>, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTX2YYFBZRFEOPT3FELYCU4QRAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHE3DEMJQGM&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=t8M4vl5Af6_C27ckjPzJGQML9gzU3R7MVyHnx5F7cbQ&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABRABTX2YYFBZRFEOPT3FELYCU4QRAVCNFSM6AAAAAA64XS4O6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHE3DEMJQGM&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=9pGwH941nYC6vS4VlvwK0Q&m=wMrA4-B_BFAGqMNTTvTbSIJaYNaYzXQrB7Q1ymOIVgjpudcJvUFLFVAf70Q4CMnw&s=t8M4vl5Af6_C27ckjPzJGQML9gzU3R7MVyHnx5F7cbQ&e=>. You are receiving this because you commented.Message ID: @. @.>>

maddyscientist commented 8 months ago

@lcosmai you also need to use the develop version of QUDA from GitHub. We haven’t made a release tag since the two link smearing support was merged in. Thx

lcosmai commented 8 months ago

Following your suggestions, I successfully compiled the develop branch of QUDA with OPENMPI (-DQUDA_MPI=ON) and the develop branch of the MILC code (-DQUDA_SMEAR_GAUSS_TWOLINK=ON).

I also tested the MILC code on a GPU cluster equipped with 4 NVIDIA Ampere GPUs, 64GB HBM2, and 32 Intel Ice Lake cores per node.

I appreciate your kind support.