stfc / PSyclone

Domain-specific compiler and code transformation system for Finite Difference/Volume/Element Earth-system models in Fortran
BSD 3-Clause "New" or "Revised" License
107 stars 28 forks source link

[PSyAD/LFRic] Test harness: Ensure that the TL/AD routines have enough influence on inner products to be detected by the test. #2718

Open mo-joshuacolclough opened 1 month ago

mo-joshuacolclough commented 1 month ago

Description

During testing of the adjoint routines in the LFRic adjoint model, it was found that some kernels passed with false-positives due to the TL routine not having enough influence on the dot product result. In one example, we found that the TL call changed the dot product by 31 * machine tolerance, which is far below the detection range of the test (overall_tolerance = 1500 * machine tolerance).

The reason for this small influence on the dot product was due to the nature of the kernel - it takes a linearisation state field, which is automatically set up with random values between [0.0, 1.0]. Scaling this field resulted in the detection of the TL/AD mismatch, as it increased the influence of the TL/AD routines on the dot product result.

Proposal

The solution to the underlying problem relies on knowledge of the specific kernel (how the kernel uses the ls field), therefore it is hard to suggest a fix for the underlying issue - how to initialise the linearisation state fields to work "nicely" with a given kernel.

Instead this problem could be detected automatically and result in a test failure:

Then a patch can be made for that specific kernel test to scale the inputs appropriately.

(Tagging @DrTVockerodtMO for visibility).

mo-joshuacolclough commented 1 month ago

Example of the proposed solution on a patched test:

    ! Calculate inner0 = x_innerproduct_x( activeX ) + x_innerproduct_x( activeY )
    ! Perform TL forwards
    ! Calculate inner1 = x_innerproduct_x( activeX ) + x_innerproduct_x( activeY )
    ! ...

    MachineTol = SPACING(MAX(ABS(inner0), ABS(inner1)))
    relative_diff = ABS(inner0 - inner1) / MachineTol
    if (relative_diff <= overall_tolerance) then
      WRITE(log_scratch_space, *) "FAILED finicky_kernel_type: TL does not have &
        &enough influence to ensure failure is detected. ", inner0, inner1, relative_diff
      call log_event(log_scratch_space, log_level_error)
    end if

    ! Go on to perform AD, and AD/TL comparison <AMx, x> == <Mx, Mx>

Result for the problematic kernel:

ERROR:  FAILED finicky_kernel_type: TL does not have enough influence to ensure failure is detected.    10085.7506053876        10085.7506053876        31.0000000000000