lattice / quda

QUDA is a library for performing calculations in lattice QCD on GPUs.
https://lattice.github.io/quda
Other
286 stars 94 forks source link

Multi-GPU wilson MPI broken on certain volumes with full fields #109

Closed maddyscientist closed 11 years ago

maddyscientist commented 11 years ago

The multi-GPU Wilson solver is presently broken for certain communications partitions. I have tracked this to commit b62515dfb67e55dcaf7f42f655e398450f6c439d which is when the merge of the non-deg twisted mass branch was merged.

The following test (when compiled with mpi multi-GPU) fails with an invalid argument but works before this commit. tests/invert_test --xdim 8 --ydim 10 --zdim 12 --tdim 14 --prec double --prec_sloppy single --partition 1

It is still broken with HEAD master. Alexei can you take a look?

alexstrel commented 11 years ago

yes, inspecting the problem, thank you for noting this, Alexei

alexstrel commented 11 years ago

Actually, the problem existed in previous commits, if one sets inv_param.solution_type = QUDA_MAT_SOLUTION, i.e., the pre-non-deg twisted mass version gave me similar error ERROR: (CUDA) invalid argument (rank 0, host dsg0210, copy_quda.cu:225 in copyCuda()), this does not happen if full precision is not double...

maddyscientist commented 11 years ago

Confirmed. Thanks for checking this Alexei. I'm currently bisecting to try and work out when this broke.

maddyscientist commented 11 years ago

Ok, on further inspection, I think there are actually two issues here.

  1. An invalid argument on dslashXpay with the solver as described above.
  2. A pre-existing problem that seems to have been present for a long time. This issue is also an invalid argument error that occurs with copy_quda.cu when solving the full system.

This second issue seems to occur when cudaMemcpy is called on the odd parity subset of a full field. Furthermore, I know Justin is having a related problem with invalid arguments which may be related to one or both of these.

I am very confused at the moment.

bjoo commented 11 years ago
  1. A pre-existing problem that seems to have been present for a long time. This issue is also an invalid argument error that occurs with copy_quda.cu when solving the full system.

This second issue seems to occur when cudaMemcpy is called on the odd parity subset of a full field. Furthermore, I know Justin is having a related problem with invalid arguments which may be related to one or both of these.

As an extra data point: I've not fallen afoul of this yet as far as I know. I suspect the reason is because typically I don't solve the full system. (I always solve the preconditioned solve with QUDA and then reconstruct myself).

Best,

B

Dr Balint Joo High Performance Computational Scientist Jefferson Lab 12000 Jefferson Ave, Suite 3, MS 12B2, Room F217, Newport News, VA 23606, USA Tel: +1-757-269-5339, Fax: +1-757-269-5427

email: bjoo@jlab.org

maddyscientist commented 11 years ago

I guess this issue never arose before is because the only app that uses the full system solver is qcdlib (qop) and that app has never been updated to support multi-GPU capability.

I believe the issue it related to how the parity fields within a full field referenced. There is some potentially dangerous pointer behaviour going on in there. I haven't tracked down the issue yet. One solution may be to do away entirely with the concept of a contiguous full field and have full fields as simply being some meta data and having pointers to even and odd sub-fields.

Regardless what the solution is, I shall work on making this part of the code more stable.

maddyscientist commented 11 years ago

Commit c7ed9a1d043e7ff49fd60e316837946dafd14e4b closes this issue.

The problem was caused by the ALIGNMENT_ADJUST macro that Guochun inserted years ago, which resizes the memory allocation size to ensure correct alignment for textures. The problem is that when using full fields, the parity fields (even and odd) that live inside the full field and are references to the full field would compute a different allocation size from their parent. As a result the odd field would point to the wrong location in memory (hence getting the wrong answer). Furthermore since the child field would have the wrong allocation size (too big), cudaMemcpy and other functions could break because they would be accessing unallocated memory.

In summary a pig of a bug to work out. I have fixed all of this, but a better design will be needed long term, as having two parity subfields reference a full field is very brittle (at least how it is done currently).

(I have updated the issue title to better reflect the bug for record keeping).