Towards systematic performance testing

gefux commented 4 months ago

We should make a start towards testing/assessing performance of our OQuPy functionality more systematically. I'd suggest something of this sort:

For each major functionality of OQuPy there should be a file in the tests/performance/ directory that defines a list of parameters and a function. I'll illustrate this with a sketch of how this would look like for the PT-TEBD functionality:

Say we'd like to see how PT-TEBD performs for a chain length of size 4, 8, 12, and 16 for both an XX and an XYZ model, then the parameters and function could look like this:

# file: test/performance/pt-tebd.py
# ...
parameters_A1 = [
    ["boson_alpha0.16_zeta3.0_T0.0_gauss_dt0.04"], # process_tensor_name
    [4, 8, 12, 16],                                # number_of_sites
    ["XX", "XYZ"],                                 # model
    [1.0e-6],                                      # tebd_epsrel
]

def pt_tebd_performance_A(process_tensor_name,
                        number_of_sites,
                        model,
                        tebd_epsrel):
    pt = import_process_tensor("./tests/data/process_tensors/"+process_tensor_name+".hdf5")
    # ... compute dynamics and an error estimate of the result
    return error_estimate
# ...

This is of course itself not a "test" in a strict sense, but it could be used as such in the future by automatically executing all parameter combinations (8 combinations in the above example) and comparing the error_estimates and/or runtimes with target values set by previous runs. Beside testing, this could also be used to analyse scaling behaviour and suchlike. (For example: how does computation time and memory requirements scale with number_of_sites for the different models?)

I imagine that there can be several parameter sets like the above for each performance test and also several types of performance tests, such that we'd summarize the tests by a variable called "ALL_TESTS"

ALL_TESTS = [
    (pt_tebd_performance_A, [parameters_A1, parameters_A2]),
    (pt_tebd_performance_B, [parameters_B1]),
]

Note: To do this efficiently, it would be good to have a library of several precomputed process tensors (as hinted on in the above code). Here is a separate issue on this #119 .

@piperfw: Any thoughts on this?

piperfw commented 4 months ago

Hi, yes I like the idea of this and the outline you suggest.

One thought is how to organise the results of the tests in this format (e.g. one set of results for each parameter set?), if the intention is to keep that data.

Were you thinking of using the Python profiler or writing the tests ad hoc? The information from cProfile is very detailed and can be nicely visualised in browser with e.g. snakeviz (https://jiffyclub.github.io/snakeviz/ if you are not familiar). That would cover the computational performance; error and accuracy metrics would have to be calculated separately.

gefux commented 4 months ago

Hi @piperfw. I'd leave the content and format of the result ('error_estimate' in the above example) up to the specific performance test. In the simplest case this could be 'None'. Then the calling function could only assess computational resources such as CPU time and memory usage. It could be also be an error estimate coming from a comparison with a known exact result, or if that doesn't exist it could be a resulting reduced density matrix, such that a calling test function could assess how rapidly the results converge with some parameter, for example.

I agree that the use of a profiler like cProfile would fit very well here. I'd think of this, however, seperately as the second step in the process. The first step is to collect functions that are suited and meaningful for profiling. This is my above suggestion. The actual profiling etc. would follow afterwards and would be written almost independently. The most straight forward impementation of this would be for example a skript that searches all files in the tests/performance/ directory, runs all functions and compares it with a record of all computation times. The test fails if a run took significantly longer. If it was faster then the record is updated.

piperfw commented 4 months ago

Hi @gefux that make sense on both fronts, thanks! That is how I would envision the profiling working too, although there I expect there are quite a few details that need to be worked out (I do not know much about performance testing) e.g. whether some form of hardware profiling is needed for consistency, and whether we consider the accuracy when updating a record.

...or if that doesn't exist it could be a resulting reduced density matrix, such that a calling test function could assess how rapidly the results converge with some parameter, for example.

That sounds like an awful lot like automated convergence testing, which I think we may agree is not an easy problem :smiley: . So perhaps restrict to known/exact results or at least physical checks to start with.

piperfw commented 4 months ago

Ah, I realise checking whether a result is converged is quite different from suggesting different TEMPO parameters to achieve convergence, so actually that sounds workable :).

gefux commented 4 months ago

Ah, I realise checking whether a result is converged is quite different from suggesting different TEMPO parameters to achieve convergence, so actually that sounds workable :).

Yes. In the above example one could have a parameter set with a series of the convergence parameter tebd_epsrel

parameters_A2 = [
    ["boson_alpha0.16_zeta3.0_T0.0_gauss_dt0.04"], # process_tensor_name
    [16],                                          # number_of_sites
    ["XX" ],                                       # model
    [1.0e-4, 1.0e-5, 1.0e-6, 1.0e-7],              # tebd_epsrel
]

... and then observe how the result changes.

piperfw commented 3 months ago

I've discussed implementing a mean-field performance test with @JoelANB, which should be a good test of the proposed structure from a developer's viewpoint. If anything relating to the testing format arises we can discuss it here.

tempoCollaboration / OQuPy

Towards systematic performance testing #120