Fused exterior twisted-clover dslash

maddyscientist commented 9 years ago

Twisted-clover fermions do not currently support the fused exterior dslash policy (i.e., where a single kernel is used to update all boundaries instead of separate kernels for each of x, y, z and t dimensions - this improves strong scaling since it reduces latency). This is because there is no code generator for the twisted-clover fermions. Once the code generator is in place #169, support for the fused exterior kernels should be added for twisted clover dslash.

maddyscientist commented 9 years ago

Alex, that was fast! Does the code you've checked in work? I can't actually test it :)

AlexVaq commented 9 years ago

Alex, that was fast! Does the code you've checked in work? I can't actually test it :)

For singleGPU it works. I’m testing now multiGPU (which is the tricky part, because it uses everything: packing, exterior…). I’ll let you know.=

AlexVaq commented 9 years ago

I get this:

ERROR: (CUDA) invalid argument (rank 0, host cwg01, dslash_twisted_clover.cu:249 in twistedCloverDslashCuda()) I get this:

Which is something that I was getting before implementing the fused kernels and the python script. I’m suspecting that there is something wrong with the textures (you know, this problem with the clover and the cloverInv that we thought it was solved) but I need to look into the matter.=

maddyscientist commented 9 years ago

What machine are you running on? I didn't get any issues on Kepler, but I guess I didn't test Fermi (texture references) .

AlexVaq commented 9 years ago

Actually I’m running in kepler K20 (????). Maybe is something related to my system? But I find it strange because someone else of my group was complaining about the same error and he was running a completely different machine.

El 31/10/2014, a las 22:17, mikeaclark notifications@github.com escribió:

What machine are you running on? I didn't get any issues on Kepler, but I guess I didn't test Fermi (texture references) .

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/170#issuecomment-61333048.

AlexVaq commented 9 years ago

The problem is in spinor packing (the call to inSpinor->pack). If commented, everything works (giving the wrong result, of course).

I’ll try to narrow the error.=

AlexVaq commented 9 years ago

This is quite puzzling. If I go to dslash_pack.cu and insert a cudaErrorCheck(); after the call to PackFaceWilson and another one after the call to apply_twisted_clover, it works correctly.

Can you reproduce my error? I was just running invert_test for multiGPU. I’m going to keep testing… =

AlexVaq commented 9 years ago

This is really puzzling. I removed the checkCudaError() and it works without issues. But again, 12 and 8 reconstruction make the solver fail. The problem lies right from the start in the prepared source: the norm differs whether I use 18 or other reconstruction.

This reminds me to the issue of clover fermions that I pointed out before, but I checked and certainly Mike pushed a fix for the reconstruction during the clover creation process. I don’t know what’s wrong.

Can you try this check? Just two twisted-clover inversions with the same source, but the reconstruction in one case is 18 and in the other 12 or 8, and check if the prepared source (before the inversion starts) is the same.

maddyscientist commented 9 years ago

Does the difference between 12/8 and 18 persist with periodic instead of anti periodic boundary conditions?

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

AlexVaq commented 9 years ago

You got it. It's the boundary conditions. I'm trying to figure out where the bug is exactly.

AlexVaq commented 9 years ago

Let me guess, boundary checks assume the volume of the gauge field is recorded in the X[] array, but in the extended versions, the boundaries are different, because we are adding 2 to the size of the gauge field per direction. Can it be this?

AlexVaq commented 9 years ago

Anyway, I think we should check include/gauge_field_order.h, the time boundary check. There is a function for that. The mistake might be there.

maddyscientist commented 9 years ago

Yes, I'm pretty sure this is the source of problems. I'll think about a solution. One easy thing to do would be to remove the boundary condition when the extended gauge file is created.

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by

reply email and destroy all copies of the original message.

AlexVaq commented 9 years ago

With my last commit (290ee4571854f1819728947a03e334545944285a) I think this has been solved.

mathiaswagner commented 9 years ago

Does this still need testing or can we close ist and ho on with pushing 0.7 through the door?

AlexVaq commented 9 years ago

I tested it in 2GPUs for several splittings (xyz and t) and matpc (symmetric, asymmetric, ee and oo), and worked properly. However, it was only on kepler, without shared dslash and no GPU_COMMS or Pthreads, so to tell you the truth there might be a broken case (I'm not expecting it, but I didn't test all the options).

I told Mike I'm confident it's ok, because other options were ok in the past and I don't think my modifications broke anything, but I'm not truly 100% sure.

I'll try to test some more cases later today and let you all know.

Ciao,

Alex

PD: We need to write a clover operator on CPUs, in order to have automatic testing on clover and twisted-clover. This is still pending, and although one can test the regularizations with trivial clover term, this is not truly a test. As for me, I'm using tmLQCD for checks.

El 9/1/2015, a las 19:35, Mathias Wagner notifications@github.com escribió:

Does this still need testing or can we close ist and ho on with pushing 0.7 through the door?

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/170#issuecomment-69377326.

maddyscientist commented 9 years ago

The pthreads support for 0.7 isn't necessary, as it's only a proof of concept at the moment. Verifying the GPU_COMMS support is working would be nice though. Since more of the code is shared (especially now that it uses the regular Wilson packing kernels), I'm fairly confident things will be ok across architecture types.

I think now is the time to do final testing and then release. The main test I have to do is to verify the Fortran interface is ok, and make any changes required to get BQCD working with 0.7.

AlexVaq commented 9 years ago

I noticed that invert_test might diverge with FUSED kernels if:

Mixed precision is use(I'm trying double/single, but single and double alone were ok).
Tuning is enabled.

I don't understand why, nonetheless I noticed that plain Wilson fermions will diverge as well, so there might be a problem in the tuning framework.

mathiaswagner commented 9 years ago

Did you start tuning fresh or did you reuse some existing cache?

Could you (for convenience) send the command you used to start the test?

AlexVaq commented 9 years ago

For wilson:

mpiexec -np 2 ./invert_test --dslash_type wilson --xdim 24 --ydim 24 --zdim 24 --Lsdim 1 --tdim 24 --tgridsize 2 --prec double --prec_sloppy single --tune true --recon 12 --recon_sloppy 12

For twisted-clover, just change dslash_type.

I tried both: fresh tune start and re-using tuning done previously for double and single. The funny thing is that dslash_test passes in all cases (for twisted-clover, set clover_coeff to 1e-20 or something like that, so you can compare with twisted-mass until a clover implementation on CPUs is developed).

I'm trying other directions for Wilson (just because Wilson is supposed to be correct), but the cluster is.. hanging?? It works very slowly, I don't understand why.

Anyway, there is something ther ethat needs some research, but I think is not twisted-clover related.

AlexVaq commented 9 years ago

By the way, I'm using K20m

maddyscientist commented 9 years ago

Alex, I suspect there is something wrong with the tune labelling leading to degeneracy of tuned parameters. Can you do show me what the tunecache file that's generated looks like. There should be separate entries for the interior dslash and the exterior fused one, both in single and double precision.

AlexVaq commented 9 years ago

Ok, I'll send u an email as soon as I'm available.

Alex

El 13/1/2015, a las 19:35, mikeaclark notifications@github.com escribió:

Alex, I suspect there is something wrong with the tune labelling leading to degeneracy of tuned parameters. Can you do show me what the tunecache file that's generated looks like. There should be separate entries for the interior dslash and the exterior fused one, both in single and double precision.

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/170#issuecomment-69794516.

AlexVaq commented 9 years ago

I didn't see anything wrong with the tunecache file. Actually, I saw what you were expecting, and miraculously now the twisted_clover inverter works properly (?????), even for mixed precision (????????). As far as I know, I'm not runnign a different command, I'll have to look at this issue closely.

Nonetheless wilson is giving me problems (??????????????????). Can you reproduce this???

Wilson, dslash_test 0 with dagger, 12 recon, no tuning, temporal splitting, FUSED kernels, no GPU_COMMS (I couldn't make it work) on 2 K20m, Cuda 5.5:

CMD:

[avaquero@cwg01 tests]$ mpiexec -np 2 ./dslash_test --test 0 --dslash_type wilson --xdim 24 --ydim 24 --zdim 24 --Lsdim 1 --tdim 24 --tgridsize 2 --prec double --tune false --recon 12 --dagger running the following test: prec recon test_type matpc_type dagger S_dim T_dimension Ls_dimension dslash_type niter double 12 0 even_even 1 24/ 24/ 24 24 1 wilson 10 Grid partition info: X Y Z T 0 0 0 1 Randomizing fields... done. Found device 0: Tesla K20m Found device 1: Tesla K20m Using device 0: Tesla K20m Loaded 41 sets of cached parameters from /home/avaquero/git/quda//tunecache.tsv Sending gauge field to GPU Creating cudaSpinor Creating cudaSpinorOut Sending spinor field to GPU Source: CPU = 2.653480e+06, CUDA = 2.653480e+06 Creating a DiracWilsonPC operator

Spinor mem: 0.030 GiB Gauge mem: 0.000 GiB Calculating reference implementation...done. Executing 10 kernel loops... done.

2495.225519us per kernel call GFLOPS = 87.756461 GB/s = 153.174913

[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from dslash [ RUN ] dslash.verify Results: CPU = 76926911.013883, CUDA=76483254.476183, CPU-CUDA = 76483254.476172 [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from dslash [ RUN ] dslash.verify 0 fails = 13806 1 fails = 13793 2 fails = 13803 3 fails = 13812 4 fails = 13800 5 fails = 13802 6 fails = 13795 7 fails = 13807 8 fails = 13797 9 fails = 13814 10 fails = 13799 11 fails = 13807 12 fails = 13806 13 fails = 13793 14 fails = 13803 15 fails = 13812 16 fails = 13800 17 fails = 13802 18 fails = 13795 19 fails = 13807 20 fails = 13797 21 fails = 13814 22 fails = 13799 23 fails = 13807 1.000000e-01 Failures: 283970 / 3981312 = 7.132573e-02 1.000000e-02 Failures: 327034 / 3981312 = 8.214227e-02 1.000000e-03 Failures: 331270 / 3981312 = 8.320624e-02 1.000000e-04 Failures: 331722 / 3981312 = 8.331977e-02 1.000000e-05 Failures: 331768 / 3981312 = 8.333132e-02 1.000000e-06 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-07 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-08 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-09 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-10 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-11 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-12 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-13 Failures: 331776 / 3981312 = 8.333333e-02 1.000000e-14 Failures: 331887 / 3981312 = 8.336121e-02 1.000000e-15 Failures: 1184594 / 3981312 = 2.975386e-01 1.000000e-16 Failures: 3427854 / 3981312 = 8.609860e-01 dslash_test.cpp:878: Failure Expected: (deviation) <= (tol), actual: 1 vs 1e-12 CPU and CUDA implementations do not agree [ FAILED ] dslash.verify (5023 ms) [----------] 1 test from dslash (5023 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (5023 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] dslash.verify

1 FAILED TEST WARNING: Tests failed dslash_test.cpp:878: Failure Expected: (deviation) <= (tol), actual: 1 vs 1e-12 CPU and CUDA implementations do not agree [ FAILED ] dslash.verify (5041 ms) [----------] 1 test from dslash (5041 ms total)

[----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (5041 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] dslash.verify

1 FAILED TEST

           initQuda Total time = 0.98246 secs

      loadGaugeQuda Total time = 0.135262 secs
          download     = 0.119878 secs (  88.6%), with        1 calls

at 1.198780e+05 us per call init = 0.014906 secs ( 11%), with 1 calls at 1.490600e+04 us per call compute = 0.000000 secs ( 0%), with 1 calls at 0.000000e+00 us per call free = 0.000474 secs ( 0.35%), with 1 calls at 4.740000e+02 us per call total accounted = 0.135258 secs ( 100%) total missing = 0.000004 secs (0.00296%)

            endQuda Total time = 0.240563 secs

Device memory used = 304.0 MB Page-locked host memory used = 163.5 MB Total host memory used >= 315.4 MB

[avaquero@cwg01 tests]$

On Tue, Jan 13, 2015 at 7:49 PM, Alejandro Marino Vaquero Avilés-Casco < serianam@gmail.com> wrote:

Ok, I'll send u an email as soon as I'm available.

Alex

El 13/1/2015, a las 19:35, mikeaclark notifications@github.com escribió:

Alex, I suspect there is something wrong with the tune labelling leading to degeneracy of tuned parameters. Can you do show me what the tunecache file that's generated looks like. There should be separate entries for the interior dslash and the exterior fused one, both in single and double precision.

— Reply to this email directly or view it on GitHub https://github.com/lattice/quda/issues/170#issuecomment-69794516.

maddyscientist commented 9 years ago

I've reproduced this problem now, it seems to be triggered by the dagger option (all reconstructions exhibit this). Investigating.

maddyscientist commented 9 years ago

Ok fixed this bug in commit ca64d11d1ff99e8eb5e40c8500fb96244cec174e. The fused exterior dagger Wilson kernel wasn't actually being built.

maddyscientist commented 9 years ago

Closing this bug as I believe we have twisted-clover fused now working.

lattice / quda