Feature/milc quda zero copy

This pull request is a significant optimization of the MILC-QUDA interface.

When compiling for QUDA, allocate the lattice site array in page-locked memory. This allows the GPU to read/write directly to this.
When compiling for QUDA, align the link and momentum matrices on 32-byte boundaries for improved access patterns
Do not extract the gauge field and momentum field from the site array, instead pass the site array pointer to QUDA, and let QUDA take care of the extraction and insertion of required data.
The gauge force, reunitarization, and momentum exponentiation QUDA-offloaded functions use this simplified interface, and as a result are up to 3x faster since there is no CPU extraction/insertion functions.

Note this pull is mutually dependent on https://github.com/lattice/quda/pull/650 being merged.

This pull also contains #16 within it, so that should either be merged first, or simply merge this and delete #16.

milc-qcd / milc_qcd