MILC and QUDA Q charge discrepancy

cpviolator commented 4 years ago

As observed in #942, QUDA and MILC computed values of the topological charge Q are inconsistent.

MILC's F_munu code is here: Make Field Strength Tensor, https://github.com/milc-qcd/milc_qcd/blob/ac033d6faf9edb86ecb89acabb5ae3108fd618d3/generic/field_strength.c Compute Q, https://github.com/milc-qcd/milc_qcd/blob/ac033d6faf9edb86ecb89acabb5ae3108fd618d3/wilson_flow/fmunu.c#L41

QUDA's F_munu code is here: Make Field Strength Tensor, https://github.com/lattice/quda/blob/develop/include/kernels/field_strength_tensor.cuh Compute Q, https://github.com/lattice/quda/blob/develop/include/kernels/gauge_qcharge.cuh#L36

From what I can see, they use identical methods: compute the clover F{mu,nu} terms, form the six independent 1/8 * ( F{mu,nu} - F_{nu,mu} ), for nu < mu < 4, and then take the real trace of products with Levi-Civita.

If someone is willing to fine tooth comb MILC code for errors, I'll do the same for QUDA.

weinbe2 commented 4 years ago

This line looks a little weird: https://github.com/milc-qcd/milc_qcd/blob/ac033d6faf9edb86ecb89acabb5ae3108fd618d3/wilson_flow/fmunu.c#L56

But otherwise they look consistent.

cpviolator commented 4 years ago

@weinbe2 @walkloud and I had a chat about this. It does look slightly off! However, given that the fs argument is antisymmetric it works out OK on paper. I checked QUDA by switching the four lines at https://github.com/lattice/quda/blob/develop/include/kernels/gauge_qcharge.cuh#L48

with

double Q1 = getTrace(F[0] * F[5]).real();
double Q2 = getTrace(conj(F[1]) * F[4]).real();
double Q3 = getTrace(F[3] * F[2]).real();
double Q_idx = (Q1 + Q3 + Q2);

and QUDA's results were unchanged, so it must be something deeper.

kostrzewa commented 4 years ago

If it's any help, here's my implementation in tmLQCD, not sure how useful this might be as one would need to check the definitions of the used macros:

https://github.com/etmc/tmLQCD/blob/master/meas/measure_clover_field_strength_observables.c

cpviolator commented 4 years ago

@kostrzewa Third data points are always welcome! :)

I will look over this code for sure. In the meantime (and I understand if you don't have the time) a third calculation one of the lattices on NVSOCAL2 /scratch/ETMC_gauge_configurations using tmLQCD would be invaluable. Is that something you can do?

weinbe2 commented 4 years ago

I wonder if it's a difference in the order of reduction... symmE is per-site positive by construction and thus less susceptible to issues, while the topological charge fluctuates in sign from site to site.

I'd be surprised if that issue caused a deviation all the way on the third digit in the topological charge (iirc from your old data), though.

cpviolator commented 4 years ago

@weinbe2 Let me see how wildly it varies from double -> single precision...

cpviolator commented 4 years ago

There is also the fact that MILC performs a reunitarisation before any computation. This might have an effect, but if it did, I would expect it to show up in the energy data too.

kostrzewa commented 4 years ago

I will look over this code for sure. In the meantime (and I understand if you don't have the time) a third calculation one of the lattices on NVSOCAL2 /scratch/ETMC_gauge_configurations using tmLQCD would be invaluable. Is that something you can do?

If it's a small one (24c48), I can easily run that our local cluster interactively without hassle. I don't quite remember which configs I had provided a few years ago though.

Note that we use Kahan accumulators for just about any global sum, so one might well see differences. (I don't remember, however, what I saw in my crosschecks against a Fortran implementation that did not use Kahan sums)

cpviolator commented 4 years ago

@kostrzewa That would be awesome, thank you! There are some 32c64 located here /scratch/ETMC_gauge_configurations/tmclover_nf2/cA2.60.32 Is that too large?

cpviolator commented 4 years ago

@weinbe2

double precision: Computed topological charge gauge precise is -4.3976775762481104e+00 single precision: Computed topological charge gauge precise is -4.3976789550765041e+00

A noticeable difference, but not on the scale of the MILC-QUDA difference.

weinbe2 commented 4 years ago

@cpviolator gotcha, so it's likely not that...

kostrzewa commented 4 years ago

@kostrzewa That would be awesome, thank you! There are some 32c64 located here /scratch/ETMC_gauge_configurations/tmclover_nf2/cA2.60.32 Is that too large?

Seems to work fine, it will take a while though, using four cores I get about one RK4 step / 30 s or so on a 32c64 lattice. If you let me know which conf.xxxx of cA2.60.32 this was, I'll grab it from our archive.

kostrzewa commented 4 years ago

One really stupid caveat however: for some odd reason I decided at the time that I would not produce output at step 0, so I can't cross-check the unflowed value of Q without changing code :man_facepalming:

cpviolator commented 4 years ago

@kostrzewa Can please you use /scratch/ETMC_gauge_configurations/tmclover_nf2/cA2.60.32/conf.2960 Thank you.

The data from the Wilson flow issue comes from a monster L=72, but I saw equally poor agreement on an L=24 test lattice. The data you provide will be fresh data and I will cross check it against QUDA and MILC. The first few lines (even ~with~ without the step 0) should be enough.

Can you use a step size of 0.03, and either Wilson or Symanzik flow, whichever is easier for you.

kostrzewa commented 4 years ago

@cpviolator will do, just to make sure that we use the same conf.2960, could you post the checksums (they are in the last few lines of the config when you open with with less, for example). We have a "convenience copy" of that particular ensemble which combines two replicas into a single MC and I'm not sure if I provided a config from either of the two replicas or the convenience copy.

Wilson flow, dt = 0.03 it is then!

cpviolator commented 4 years ago

This good?

cpviolator@nvsocal2:/scratch/ETMC_gauge_configurations/tmclover_nf2/cA2.60.32$ head -19 conf.2960
Eg????xlf-infoplaquette = 0.603483301982
 trajectory nr = 3798
 beta = 2.100000, kappa = 0.137300, mu = 0.006000, c2_rec = -0.331000
 time = 1420235063
 hmcversion = 5.2.0
 mubar = 0.000000
 epsilonbar = 0.000000
 date = Fri Jan  2 22:44:23 2015
Eg???qildg-format<?xml version="1.0" encoding="UTF-8"?>
<ildgFormat xmlns="http://www.lqcd.org/ildg"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.lqcd.org/ildg/filefmt.xsd">
  <version>1.0</version>
  <field>su3gauge</field>
  <precision>64</precision>
  <lx>32</lx>
  <ly>32</ly>
  <lz>32</lz>
  <lt>64</lt>

kostrzewa commented 4 years ago

Yes, that should allow me to make sure that we're using the same config.

kostrzewa commented 4 years ago

traj t P Eplaq Esym tsqEplaq tsqEsym Wsym Qsym
002960 0.030000 0.686180441120 11.297504119695 2.012682869608 0.010167753708 0.001811414583 0.003485714687 11.313708126044 
002960 0.090000 0.806359850384 6.971045386193 1.662269708128 0.056465467628 0.013464384636 0.022532015458 9.617323139653 
002960 0.150000 0.878991598130 4.356302467306 1.313487134147 0.098016805514 0.029553460518 0.041039840383 7.996757259966 
002960 0.210000 0.921744280876 2.817205888463 1.026435065052 0.124238779681 0.045265786369 0.051687749455 6.881385421821 
002960 0.270000 0.947189484001 1.901178575961 0.808360459685 0.138595918188 0.058929477511 0.056259807518 6.263404426949 
002960 0.330000 0.962760173244 1.340633763207 0.647102367457 0.145995016813 0.070469447816 0.057995356994 5.978689538200 
002960 0.390000 0.972622504749 0.985589829037 0.527972731282 0.149908212997 0.080304652428 0.058991949457 5.882477193694 
[...]
002960 7.770000 0.999618829708 0.013722130520 0.013204774887 0.828444813700 0.797210553799 0.809670253068 6.849489744762 
002960 7.830000 0.999621789102 0.013615592314 0.013105223728 0.834756987595 0.803466851031 0.816980488919 6.815969477900 
002960 7.890000 0.999624699014 0.013510835485 0.013007301223 0.841077881724 0.809731816447 0.824454866684 6.787916678890 
002960 7.950000 0.999627558938 0.013407878238 0.012910985464 0.847411424306 0.816006558789 0.832086680778 6.765356483408 
002960 8.010000 0.999630368774 0.013306724134 0.012816252253 0.853760751117 0.822292026165 0.839847954761 6.747755044396 
002960 8.070000 0.999633129003 0.013207355892 0.012723073332 0.860127731726 0.828588878452 0.847697893605 6.734258904150

The last column is the topological charge.

cpviolator commented 4 years ago

Excellent, thank you @kostrzewa. These are QUDA's results:


t, Et, Es, (Et+Es), Q
0.000000e+00 +1.0744528619619671e+00 +1.0743300497637591e+00 +2.1487829117257262e+00 -1.1986946054939605e+01
3.000000e-02 +1.0064045485004334e+00 +1.0062787721926125e+00 +2.0126833206930459e+00 -1.1382100367471125e+01
6.000000e-02 +9.2211203079526127e-01 +9.2197263868241142e-01 +1.8440846694776727e+00 -1.0555243689364456e+01
9.000000e-02 +8.3121319447686459e-01 +8.3105707182172561e-01 +1.6622702662985902e+00 -9.6496416117783319e+00
1.200000e-01 +7.4119045249480209e-01 +7.4102131485710798e-01 +1.4822117673519100e+00 -8.7771558866186208e+00
1.500000e-01 +6.5683116829430466e-01 +6.5665626921387932e-01 +1.3134874375081840e+00 -8.0084819567610008e+00

kostrzewa commented 4 years ago

So we definitely have a sign difference in Q :) and our energies (Esym) match only up to the sixth digit or so.

cpviolator commented 4 years ago

Yeah, the minus sign I can live with, but the Q data is again quite different. 6th DP on Esym is acceptable. I'll run this lattice through MILC and see what we find there. Thank you so much for this cross check!

cpviolator commented 4 years ago

Data after addressing https://github.com/lattice/quda/pull/942/files#r369848547

Computed topological charge gauge precise is -1.1986946054939605e+01 Done in 0.012059 secs
Computed topological charge gauge precise from density function is -1.1986946054939605e+01
GPU value -1.198695e+01 and host density sum -1.198695e+01. Q charge deviation: 2.255973e-13
Plaquette after 0 WFlow steps: 1.8104499059467403e+00 1.8105264779275636e+00 1.8103733339659174e+00
0.000000e+00 +1.0744528619619671e+00 +1.0743300497637591e+00 +2.1487829117257262e+00 -1.1986946054939605e+01
3.000000e-02 +1.0064045485004334e+00 +1.0062787721926125e+00 +2.0126833206930459e+00 -1.1382100367471125e+01
6.000000e-02 +9.2211203079526127e-01 +9.2197263868241142e-01 +1.8440846694776727e+00 -1.0555243689364456e+01
9.000000e-02 +8.3121319447686459e-01 +8.3105707182172561e-01 +1.6622702662985902e+00 -9.6496416117783319e+00
1.200000e-01 +7.4119045249480209e-01 +7.4102131485710798e-01 +1.4822117673519100e+00 -8.7771558866186208e+00
1.500000e-01 +6.5683116829430466e-01 +6.5665626921387932e-01 +1.3134874375081840e+00 -8.0084819567610008e+00

No change :(

cpviolator commented 4 years ago

Changing boundary conditions gauge_param.t_boundary = QUDA_PERIODIC_T; -> gauge_param.t_boundary = QUDA_ANTI_PERIODIC_T; changes nothing, data is the same.

walkloud commented 4 years ago

@cpviolator here is output from MILC code on the tmLQCD 2960 config - including some info from the input file showing the setup - deviations from QUDA in the 4th or 3rd significant digit

reload_parallel conf.2960
wilson
set staple to wilson
exp_order 8
stepsize 0.03
stoptime 0.3

read_gauge_hdr: Reading as a SciDAC formatted file
Restoring binary SciDAC file conf.2960
Checksums 8e393aee 3a80299a

CHECK PLAQ: 1.8105264779275636e+00 1.8103733339659176e+00
CHECK NERSC LINKTR: 6.8127337546869897e-05 CKSUM: 256b3e81
Reunitarized for double precision. Max deviation 4.28e-14 changed to 6.66e-16
Time to check unitarity = 2.346587e-02
#LABEL time        Et                Es             charge
WFLOW 0 1.074452861961966 1.074330049763758 -11.90848619680366
WFLOW 0.03 1.006404323536305 1.006278546071407 -11.31370812604431
WFLOW 0.06 0.922111732726292 0.9219723390713984 -10.50561298758942
WFLOW 0.09 0.8312129160320689 0.8310567920960001 -9.617323139652836
WFLOW 0.12 0.7411902341742541 0.7410210957709286 -8.757348166079716
WFLOW 0.15 0.6568310167570317 0.6566561173895067 -7.996757259965846
WFLOW 0.18 0.580596001557653 0.5804227834380824 -7.369587514125599
WFLOW 0.21 0.5133004980431644 0.5131345670088233 -6.881385421821498
WFLOW 0.24 0.454756808396089 0.4546013139350166 -6.51968159474718
WFLOW 0.27 0.40425228268451 0.4041081770009374 -6.263404426948907

comparing with tm-code - looks to be the same - grabbing from @kostrzewa values above:

traj t P Eplaq Esym Qsym
002960 0.030000 0.686180441120 11.297504119695 2.012682869608 11.313708126044 
002960 0.090000 0.806359850384 6.971045386193 1.662269708128  9.617323139653 
002960 0.150000 0.878991598130 4.356302467306 1.313487134147  7.996757259966 
002960 0.210000 0.921744280876 2.817205888463 1.026435065052 6.881385421821 
002960 0.270000 0.947189484001 1.901178575961 0.808360459685  6.263404426949 
002960 0.330000 0.962760173244 1.340633763207 0.647102367457  5.978689538200

cpviolator commented 4 years ago

Ok, that's pretty conclusive. I'll bug hunt...

cpviolator commented 4 years ago

@kostrzewa @walkloud Thank you for spinning up and testing this lattice, it narrows the search down immensely.

cpviolator commented 4 years ago

As you've seen, we've cross checked this against tmLQCD and MILC. In both cases, the energy computation is correct to an acceptable 6 d.p., but the Q charge is significantly different. (MILC and tmLQCD differ by only a minus sign)

As an experiment, I hacked the energy kernels to return a double2 composed of the spatial energy and the qcharge, and did not call the qcharge routine at all. This means we have a single instance of the F_munu tensor field construction, and the energy and charge are computed from the same data in the same kernel. The energy data was good, the charge was still out.

t, Q, Es
0.000000e+00 -1.1986946054939603e+01 +1.0743300497637591e+00
3.000000e-02 -1.1382100367471125e+01 +1.0062787721926125e+00
6.000000e-02 -1.0555243689364460e+01 +9.2197263868241131e-01
9.000000e-02 -9.6496416117783319e+00 +8.3105707182172561e-01

Now, because this data comes from the same kernel, using the same raw F_munu data, we can narrow the search space down. Because the energies match, the construction of the F_munu fields must be good, and the reduction kernel is doing its job properly. The only difference between the energy calculation and the charge calculation are the numbers themselves.

This points to Evan's comment https://github.com/lattice/quda/issues/959#issuecomment-577354954 about the nature of the data. Could it be that the summation of +ve and -ve reals is causing this error? I would remind everybody that as flow time increases, the QUDA Q charge value converges to the same numbers produced by tmLQCD and MILC.

weinbe2 commented 4 years ago

@cpviolator one idea is running the Q compute on the CPU instead of the GPU in QUDA. I think you can just hack that in my manually copying the gauge field to a cpuGaugeField at the right spot. That'd give an order of summation that's likely closer to what's being done in MILC and tmLQCD.

Another idea is putting printf calls into the kernels and printing the checkerboard index and per-site topological charge, comparing it with a per-site printf from MILC (bearing in mind the different flattened indices). This would diagnose if there's a slight deviation on a site-by-site basis, even though I know you've checked and as far as you can tell that isn't the case.

cpviolator commented 4 years ago

I like the CPU check, I'll try that later. Thanks @weinbe2 !

kostrzewa commented 2 years ago

Just to comment on the mention in PR #1234: @Marcogarofalo and @simone-romiti found that the topological charge computed by QUDA still differs from the one computed by tmLQCD (which in turn agrees with MILC).

@Marcogarofalo could you perhaps post the comparison that we looked at during our last call?

Marcogarofalo commented 2 years ago

@Marcogarofalo could you perhaps post the comparison that we looked at during our last call?

sure QUDA

t        Eplaq            Qsym
0.010000 24.604398431378  0.020506984129
0.050000 21.444100662335  0.009464026698
0.090000 18.742298644958  -0.008580991298
0.130000 16.502281193395  -0.030403163614
0.170000 14.662190735698  -0.053541522707

tmLQCD

t        Eplaq            Qsym
0.010000 24.604398357148  -0.018369191909
0.050000 21.444100524446  -0.009167331350
0.090000 18.742298603505  0.007926215192
0.130000 16.502281231195  0.029462854819
0.170000 14.662190804582  0.052614735741

kostrzewa commented 2 years ago

Note that the sign difference is known. I think that the absolute value should be much more consistent than it is, however.

kostrzewa commented 2 years ago

Just digging around in the dark to see where we get.

@Marcogarofalo when you set param.su_project = QUDA_BOOLEAN_TRUE in the QudaGaugeObservableParam used for the measurement in QUDA, does this affect the difference between QUDA and tmLQCD? (the default is QUDA_BOOLEAN_FALSE, see lib/check_params.h around line 970 or so.

We don't perform SU3 projection in tmLQCD in the gradient flow but we might use a different type of exponentiation. I have to admit that I haven't really taken any time to compare exactly what is done in https://github.com/lattice/quda/pull/942/files (exponentiate_iQ) to our implementation of the Cayley-Hamilton.

cpviolator commented 2 years ago

@kostrzewa There are some debugging tools you can enable by commenting out the pre processor define: https://github.com/lattice/quda/blob/develop/include/kernels/gauge_stout.cuh#L90 to ensure various SU(N) matrix properties.

Also, you could swap out the exponentiate_iQ with a Taylor series expansion, implemented here for Nc builds: https://github.com/lattice/quda/blob/74057e061f66c91f8e7bf11ada419e0403590407/include/quda_matrix.h#L1197

IIRC, I've tried this, but there is always that % uncertainty. If you try this we can be doubly sure.

Marcogarofalo commented 2 years ago

@Marcogarofalo when you set param.su_project = QUDA_BOOLEAN_TRUE in the QudaGaugeObservableParam used for the measurement in QUDA, does this affect the difference between QUDA and tmLQCD? (the default is QUDA_BOOLEAN_FALSE, see lib/check_params.h around line 970 or so.

Qsym changes very little, and the result is still very close to QUDA in https://github.com/lattice/quda/issues/959#issuecomment-1010223108 Below the numbers for QUDA with param.su_project = QUDA_BOOLEAN_TRUE

t  Esym  Qsym
0.010000  2.472265670828  -0.020506974856
0.050000  2.564649449487  -0.009463987808
0.090000  2.616351723203  0.008581047928
0.130000  2.633541129555  0.030403229983
0.170000  2.623627193432  0.053541597995

kostrzewa commented 2 years ago

Thanks a lot! That likely removes this as a culprit. The other thing we can exclude is a constant factor and as @cpviolator noted before, the two implementations seem to agree more and more with increasing flow time (with some fluctuations, a real check would probably need a larger lattice and an analysis across the full flow time).

0.020506984129/0.018369191909 = 1.116379
0.00946402669/0.009167331350 = 1.032364
0.008580991298/0.007926215192 = 1.082609
0.030403163614/0.029462854819 = 1.031915
0.053541522707/0.052614735741 = 1.017615
[...]

kostrzewa commented 2 years ago

@kostrzewa There are some debugging tools you can enable by commenting out the pre processor define: https://github.com/lattice/quda/blob/develop/include/kernels/gauge_stout.cuh#L90 to ensure various SU(N) matrix properties.

Also, you could swap out the exponentiate_iQ with a Taylor series expansion, implemented here for Nc builds:

Thanks for the suggestions @cpviolator. I don't know how much time we'll be able to invest into this given that the difference is quite small. We might just accept it with the caveat that there's a few-%-level uncertainty on the individual measurements of Q. I would argue that this kind of difference is not going to mask conclusions of the type "topology is stuck" vs. "we are sampling topological configurations acceptably", which is what we need at present.

cpviolator commented 2 years ago

@kostrzewa Indeed, my colleagues and I came to a similar conclusion WRT topological behaviour during HMC. We found that the most variance occurs in early Wilson Flow times, and that after long Wilson Flow times the QUDA Q charge value converges to the MILC value very well.

@weinbe2 made some good suggestions with atomic reductions and a CPU implementation to try to get the root of the issue.

maddyscientist commented 2 years ago

@kostrzewa @Marcogarofalo can you give me a link to the tmLQCD implementation of the topological charge? I assume there's no obvious difference between that and QUDA's one? Have you looked at QUDA's one to see if anything sticks out?

Marcogarofalo commented 2 years ago

I believe that the topological charge in tmLQCD is computed at https://github.com/etmc/tmLQCD/blob/ce9753848039a9a2f63eeff6d244fd1b4c62718d/meas/measure_clover_field_strength_observables.c#L56 I haven't looked at QUDA implementation in much detail so far

cpviolator commented 2 years ago

@Marcogarofalo @kostrzewa Can you please use the current feature/Wflow_interface branch to recheck the Wilson flow data? I was trying to do a cross check with MILC but my MILC compiling skills have evaporated over the past two years. Just be sure to set su_project = QUDA_BOOLEAN_TRUE as is done here

https://github.com/lattice/quda/blob/feature/Wflow_interface/tests/su3_test.cpp#L263

when you set the gauge observables param in your TMLQCD code. I would be really appreciative because this bug, though somewhat benign, has been on the issues list for way too long.

Marcogarofalo commented 2 years ago

with QUDA 39fb75e2f1b571cf64fe094f5d79470c239cbc90 setting su_project = QUDA_BOOLEAN_TRUE I get

 t   Eplaq  Qsym
0.010000  24.604398350547  0.020506974723 
0.050000  21.444100501967  0.009463987616 
0.090000  18.742298576666  -0.008581048171 
0.130000  16.502281205973  -0.030403230253 
0.170000  14.662190782782  -0.053541598262

which are very similar to https://github.com/lattice/quda/issues/959#issuecomment-1010223108 and still differ from tmLQCD

cpviolator commented 2 years ago

Thank you very much @Marcogarofalo ! I'll perform a direct MILC comparison, keep hunting.

cpviolator commented 2 years ago

Digging into MILC I see that their strategy is to exponentiate the traceless, anti hermitian matrices using a Taylor series, whereas QUDA performs a Cayley-Hamilton decomposition. I think this is enough to cause the deviations we see. Once I have MILC up and running I will hack QUDA to use a Taylor series and get the digits to match, then I think we can close this with a caveat.

maddyscientist commented 2 years ago

Since MILC matches tmLQCD, what strategy does it employ?

cpviolator commented 2 years ago

Cayley Hamilton it seems... https://github.com/etmc/tmLQCD/blob/master/meas/gradient_flow.c#L106

lattice / quda

MILC and QUDA Q charge discrepancy #959