mdolab / adflow

ADflow is a finite volume RANS solver tailored for gradient-based aerodynamic design optimization.
Other
220 stars 98 forks source link

Fix derivs of mphys wrapper #311

Closed joanibal closed 11 months ago

joanibal commented 1 year ago

Purpose

This PR fixes a subtle memory error caused by #308. After refactoring, the setting of volume and state data of the wrapper it is now possible to update the geometry data without updating the state data. This ought to be the way things are. However, the geometry update routines free some data while running to reduce the total memory usage and then reallocate the data once done (see computeWallDistance) .

The radi, radj, and radk, data of each block was being erased and not being recomputed before running the reverse mode jac vec products. The valuesradi, radj, and radk used sometimes were incorrect or NANs. Although this was not the case all the time as some optimizations did work for me.

I fixed this by stopping computeWallDistanc from reallocating radi, radj, and radk by changing a flag. I'm not sure what other implications this has...

Expected time until merged

A few days since the main branch is broken

Type of change

Testing

This is tricky, since sometimes the error doesn't occur. I suggest running a long optimization with many constraints.

Checklist

codecov[bot] commented 1 year ago

Codecov Report

Merging #311 (6476180) into main (ca18a7e) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##             main     #311   +/-   ##
=======================================
  Coverage   41.94%   41.94%           
=======================================
  Files          13       13           
  Lines        4005     4005           
=======================================
  Hits         1680     1680           
  Misses       2325     2325           

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

joanibal commented 1 year ago

The tests were passing before i merged in the main. And the error seems like a path or dependency issue.

''' /bin/sh: 1: /home/vsts/work/1/s/tapenade3.10/bin/tapenade: not found make: *** [Makefile_tapenade:305: ad_forward] Error 127 Tapenade runscript failed '''

eirikurj commented 1 year ago

@joanibal the test fails are due to the network outage of the UM campus. Once restored, we should be able to rerun the tests.

A-CGray commented 1 year ago

@joanibal thanks for this, did you do any comparison of performance (in terms of memory usage or runtime) with and without this change?

sseraj commented 1 year ago

308 has not been merged yet, so maybe we should address this all in the same PR?

joanibal commented 1 year ago

@joanibal thanks for this, did you do any comparison of performance (in terms of memory usage or runtime) with and without this change?

I haven't done any comparison. radi, radj, and radk have dimensions ie*je*ke which is about the number of cells, so the memory usage is about 3num_cells 8 bytes. For a million cell mesh this would be about 24 mega bytes.... This sounds lower than I would expect since someone went through the effort of temporarily de-allocating these variables. However, I didn't see much of a change in memory usage for my toy problem.
In regard to wall time, if anything I would expect this to be very slightly faster since you don't have to reallocate memory during each sensitivity analysis. But I'd expect not different enough to notice.

joanibal commented 1 year ago

308 has not been merged yet, so maybe we should address this all in the same PR?

It would be nice if we could get this merged quickly since the code in the current docker images doesn't work. So I'm for whatever makes that happen.

A-CGray commented 1 year ago

Thanks for the explanation. These changes seem fine to me on the surface but my knowledge of the Fortran parts of ADflow are minimal. Do you have any results you could post here that show that this fix works?

joanibal commented 1 year ago

Here is a comparison of some of the derivatives with FD

-----------------
Total Derivatives
-----------------

  Full Model: 'cruise.aero_post.drag' wrt 'geom_dvs.nacelle:XSecCurve_15:Circle_Diameter'
    Analytic Magnitude: 2.283213e+02
          Fd Magnitude: 2.282604e+02 (fd:None)
    Absolute Error (Jan - Jfd) : 6.093151e-02 *

    Relative Error (Jan - Jfd) / Jfd : 2.669386e-04 *

    Raw Analytic Derivative (Jfor)
[[-228.32132549]]

    Raw FD Derivative (Jfd)
[[-228.26039398]]
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Full Model: 'cruise.aero_post.drag' wrt 'geom_dvs.nacelle:XSecCurve_8:Circle_Diameter'
    Analytic Magnitude: 8.221510e+00
          Fd Magnitude: 8.733695e+00 (fd:None)
    Absolute Error (Jan - Jfd) : 5.121854e-01 *

    Relative Error (Jan - Jfd) / Jfd : 5.864476e-02 *

    Raw Analytic Derivative (Jfor)
[[-8.22150967]]

    Raw FD Derivative (Jfd)
[[-8.73369509]]
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
  Full Model: 'cruise.aero_post.drag' wrt 'geom_dvs.nacelle:XSec_4:TopLAngle'
    Analytic Magnitude: 1.471658e-01
          Fd Magnitude: 1.460313e-01 (fd:None)
    Absolute Error (Jan - Jfd) : 1.134546e-03 *

    Relative Error (Jan - Jfd) / Jfd : 7.769199e-03 *

    Raw Analytic Derivative (Jfor)
[[0.14716581]]

    Raw FD Derivative (Jfd)
[[0.14603126]]
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
anilyil commented 1 year ago

I want to review this before we merge. Please bear with me in the next 2-3 days because we have the hackathon going as well.

joanibal commented 11 months ago

bumping for visibility.

joanibal commented 11 months ago

I agree that the intermediate variables need for be keep up to date too. However, the way I see is that there are two issues here. The issue isn't just that the radi-j-k variable aren't consistent with the current state, it is that they have been erased. One has to be very careful when using the updateGeometryInfo method and it doesn't even seem that the memory saved is significant (the max memory usage will be during the primal solve anyways right?). So I think there are two separate issues.

1) intermediate variables need to stay consistent with the state and geometric variables 2) The radi-j-k variables are being quietly erased.

Problem 1 is already addressed (by @A-CGray) by calling the residuals when necessary (https://github.com/mdolab/adflow/blob/ab4323d7d415621a64cf6c6c87448f6f2c341047/adflow/mphys/mphys_adflow.py#L1010-L1013) when either the states or mesh coordinates change. Problem 2 is fixed with this PR in my view.

A geometry only change in the warper for example should only require jac-vec products w.r.t. geometry so the radi-j-k values that are getting re-allocated with wrong values should not affect it.

I don't know if I understand this. My thinking was that geometric changes will affect all outputs in general and some of which will depend on radi-j-k. For example all the functional jac-vec products are non-zero for this test (https://github.com/mdolab/adflow/blob/ab4323d7d415621a64cf6c6c87448f6f2c341047/tests/reg_tests/refs/funcs_euler_upwind_tut_wing.json#L123-L152).

anilyil commented 11 months ago

If we were calling getResidual after setting the state anyways, I don't understand how radi-j-k was not being updated. In the wall distance computation, whatever that is being de-allocated is being re-allocated. So I dont understand why the rad values are still out of date after re-allocating them and calling a residual.

I agree that tiny bit of memory saving is nothing in terms of big picture memory usage and how we use the code today, but I would at least like to understand why you were facing that bug in the first place.

joanibal commented 11 months ago

Found it again. The issue is that in the warping component the residuals are not reevaluated after updating the wall distances (https://github.com/mdolab/adflow/blob/ab4323d7d415621a64cf6c6c87448f6f2c341047/adflow/mphys/mphys_adflow.py#L340-L342). This is not fixed in the following calls because it looks like no update is needed since the states and coordinates match.

I've added prints for every time setAeroproblem, getResiduals, updateGeometryInfo, and the radi-j-k are erased. As you can see just before we encounter a nan in the adjoint solve, the radi-j-k are erased in the last derivative pass. This issue wouldn't show up if you only had one function of interest.

  *calling setAeroProblem*
-> Alpha... 0.000000 
 *calling updateGeometryInfo*
coordsAreEqual True
statesAreEqual True
setting aero problem
  *calling setAeroProblem*
-> Alpha... 0.000000 
 *calling updateGeometryInfo*
coordsAreEqual True
forces jacvec updatesMade: False
Solving linear in mphys_adflow
Solving ADjoint Transpose with PETSc...
   0 KSP Residual norm 9.9701713597E+01
  10 KSP Residual norm 9.3972313250E+01
  20 KSP Residual norm 1.4586560139E+01
  30 KSP Residual norm 4.1446054106E-01
  40 KSP Residual norm 2.0087418658E-02
  50 KSP Residual norm 7.1085667748E-04
  60 KSP Residual norm 2.7533446067E-05
  70 KSP Residual norm 1.5497907585E-06
  80 KSP Residual norm 9.8505053824E-08
  90 KSP Residual norm 4.1922545475E-09
 100 KSP Residual norm 2.0938772532E-10
Solving ADjoint Transpose with PETSc time (s) =    12.82
 Norm of error = 8.7413E-10    Iterations =  100
 ------------------------------------------------
 PETSc solver converged after   100 iterations.
 ------------------------------------------------
  *calling setAeroProblem*
-> Alpha... 0.000000 
 *calling updateGeometryInfo*
coordsAreEqual True
statesAreEqual True
Updating surface coords
 *calling updateGeometryInfo*
     ERASING RADI-JK
FD jacobian calcs with dvgeovsp took 5.482940673828125 seconds in total
updating the vsp model took 4.96399450302124 seconds
evaluating the new points took 0.5003964900970459 seconds
communication took 0.0013697147369384766 seconds
setting aero problem
  *calling setAeroProblem*
-> Alpha... 0.000000 
 *calling updateGeometryInfo*
coordsAreEqual True
forces jacvec updatesMade: False
Solving linear in mphys_adflow
Solving ADjoint Transpose with PETSc...
Solving ADjoint Transpose with PETSc time (s) =     0.04
 Norm of error =        NaN    Iterations =    0
 ------------------------------------------------
 PETSc solver converged after     0 iterations.
 ------------------------------------------------
joanibal commented 11 months ago

So we could add a residual call to the derivative routines of the mesh warping component, but my gut is that this issue will pop up again. I still believe it is better to just stop erasing radi-j-k.

anilyil commented 11 months ago

Very interesting. I think this may be also relevant for https://github.com/mdolab/adflow/pull/308 The current code in this branch does not have that fix, so the surface mesh is always getting updated...

I now agree that not deleting radijk might be the best way here. Deallocating and re-allocating those 6 arrays probably just cost for no reason.

I say lets do this: We can merge #308 now, and you can test your code again with those fixes. if you are satisfied, I am fine with approving this PR as is.

anilyil commented 11 months ago

Merged that other PR. Can you please merge main (I am not doing it to avoid messing up your local installations) and test again? If you are still satisfied I can merge.

joanibal commented 11 months ago

Merged that other PR. Can you please merge main (I am not doing it to avoid messing up your local installations) and test again? If you are still satisfied I can merge.

By test again do you mean let the CI run again or check if the nan error remains without this PR?

anilyil commented 11 months ago

The CI wont be affected. Can you run your own MPhys derivative tests to make sure things are correct again after merging that last PR? The changes in that PR change the logic of geometry/surface updates, so the behavior will be different. I expect your fixes to still work, but wanted to double check.

joanibal commented 11 months ago

Yep it still works, I just tested it.