Closed joanibal closed 11 months ago
Merging #311 (6476180) into main (ca18a7e) will not change coverage. The diff coverage is
n/a
.
@@ Coverage Diff @@
## main #311 +/- ##
=======================================
Coverage 41.94% 41.94%
=======================================
Files 13 13
Lines 4005 4005
=======================================
Hits 1680 1680
Misses 2325 2325
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
The tests were passing before i merged in the main. And the error seems like a path or dependency issue.
''' /bin/sh: 1: /home/vsts/work/1/s/tapenade3.10/bin/tapenade: not found make: *** [Makefile_tapenade:305: ad_forward] Error 127 Tapenade runscript failed '''
@joanibal the test fails are due to the network outage of the UM campus. Once restored, we should be able to rerun the tests.
@joanibal thanks for this, did you do any comparison of performance (in terms of memory usage or runtime) with and without this change?
@joanibal thanks for this, did you do any comparison of performance (in terms of memory usage or runtime) with and without this change?
I haven't done any comparison.
radi
, radj
, and radk
have dimensions ie*je*ke
which is about the number of cells, so the memory usage is about 3num_cells 8 bytes. For a million cell mesh this would be about 24 mega bytes.... This sounds lower than I would expect since someone went through the effort of temporarily de-allocating these variables. However, I didn't see much of a change in memory usage for my toy problem.
In regard to wall time, if anything I would expect this to be very slightly faster since you don't have to reallocate memory during each sensitivity analysis. But I'd expect not different enough to notice.
308 has not been merged yet, so maybe we should address this all in the same PR?
It would be nice if we could get this merged quickly since the code in the current docker images doesn't work. So I'm for whatever makes that happen.
Thanks for the explanation. These changes seem fine to me on the surface but my knowledge of the Fortran parts of ADflow are minimal. Do you have any results you could post here that show that this fix works?
Here is a comparison of some of the derivatives with FD
-----------------
Total Derivatives
-----------------
Full Model: 'cruise.aero_post.drag' wrt 'geom_dvs.nacelle:XSecCurve_15:Circle_Diameter'
Analytic Magnitude: 2.283213e+02
Fd Magnitude: 2.282604e+02 (fd:None)
Absolute Error (Jan - Jfd) : 6.093151e-02 *
Relative Error (Jan - Jfd) / Jfd : 2.669386e-04 *
Raw Analytic Derivative (Jfor)
[[-228.32132549]]
Raw FD Derivative (Jfd)
[[-228.26039398]]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Full Model: 'cruise.aero_post.drag' wrt 'geom_dvs.nacelle:XSecCurve_8:Circle_Diameter'
Analytic Magnitude: 8.221510e+00
Fd Magnitude: 8.733695e+00 (fd:None)
Absolute Error (Jan - Jfd) : 5.121854e-01 *
Relative Error (Jan - Jfd) / Jfd : 5.864476e-02 *
Raw Analytic Derivative (Jfor)
[[-8.22150967]]
Raw FD Derivative (Jfd)
[[-8.73369509]]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Full Model: 'cruise.aero_post.drag' wrt 'geom_dvs.nacelle:XSec_4:TopLAngle'
Analytic Magnitude: 1.471658e-01
Fd Magnitude: 1.460313e-01 (fd:None)
Absolute Error (Jan - Jfd) : 1.134546e-03 *
Relative Error (Jan - Jfd) / Jfd : 7.769199e-03 *
Raw Analytic Derivative (Jfor)
[[0.14716581]]
Raw FD Derivative (Jfd)
[[0.14603126]]
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
I want to review this before we merge. Please bear with me in the next 2-3 days because we have the hackathon going as well.
bumping for visibility.
I agree that the intermediate variables need for be keep up to date too.
However, the way I see is that there are two issues here. The issue isn't just that the radi-j-k variable aren't consistent with the current state, it is that they have been erased. One has to be very careful when using the updateGeometryInfo
method and it doesn't even seem that the memory saved is significant (the max memory usage will be during the primal solve anyways right?). So I think there are two separate issues.
1) intermediate variables need to stay consistent with the state and geometric variables 2) The radi-j-k variables are being quietly erased.
Problem 1 is already addressed (by @A-CGray) by calling the residuals when necessary (https://github.com/mdolab/adflow/blob/ab4323d7d415621a64cf6c6c87448f6f2c341047/adflow/mphys/mphys_adflow.py#L1010-L1013) when either the states or mesh coordinates change. Problem 2 is fixed with this PR in my view.
A geometry only change in the warper for example should only require jac-vec products w.r.t. geometry so the radi-j-k values that are getting re-allocated with wrong values should not affect it.
I don't know if I understand this. My thinking was that geometric changes will affect all outputs in general and some of which will depend on radi-j-k. For example all the functional jac-vec products are non-zero for this test (https://github.com/mdolab/adflow/blob/ab4323d7d415621a64cf6c6c87448f6f2c341047/tests/reg_tests/refs/funcs_euler_upwind_tut_wing.json#L123-L152).
If we were calling getResidual after setting the state anyways, I don't understand how radi-j-k was not being updated. In the wall distance computation, whatever that is being de-allocated is being re-allocated. So I dont understand why the rad values are still out of date after re-allocating them and calling a residual.
I agree that tiny bit of memory saving is nothing in terms of big picture memory usage and how we use the code today, but I would at least like to understand why you were facing that bug in the first place.
Found it again. The issue is that in the warping component the residuals are not reevaluated after updating the wall distances (https://github.com/mdolab/adflow/blob/ab4323d7d415621a64cf6c6c87448f6f2c341047/adflow/mphys/mphys_adflow.py#L340-L342). This is not fixed in the following calls because it looks like no update is needed since the states and coordinates match.
I've added prints for every time setAeroproblem, getResiduals, updateGeometryInfo, and the radi-j-k
are erased.
As you can see just before we encounter a nan
in the adjoint solve, the radi-j-k
are erased in the last derivative pass. This issue wouldn't show up if you only had one function of interest.
*calling setAeroProblem*
-> Alpha... 0.000000
*calling updateGeometryInfo*
coordsAreEqual True
statesAreEqual True
setting aero problem
*calling setAeroProblem*
-> Alpha... 0.000000
*calling updateGeometryInfo*
coordsAreEqual True
forces jacvec updatesMade: False
Solving linear in mphys_adflow
Solving ADjoint Transpose with PETSc...
0 KSP Residual norm 9.9701713597E+01
10 KSP Residual norm 9.3972313250E+01
20 KSP Residual norm 1.4586560139E+01
30 KSP Residual norm 4.1446054106E-01
40 KSP Residual norm 2.0087418658E-02
50 KSP Residual norm 7.1085667748E-04
60 KSP Residual norm 2.7533446067E-05
70 KSP Residual norm 1.5497907585E-06
80 KSP Residual norm 9.8505053824E-08
90 KSP Residual norm 4.1922545475E-09
100 KSP Residual norm 2.0938772532E-10
Solving ADjoint Transpose with PETSc time (s) = 12.82
Norm of error = 8.7413E-10 Iterations = 100
------------------------------------------------
PETSc solver converged after 100 iterations.
------------------------------------------------
*calling setAeroProblem*
-> Alpha... 0.000000
*calling updateGeometryInfo*
coordsAreEqual True
statesAreEqual True
Updating surface coords
*calling updateGeometryInfo*
ERASING RADI-JK
FD jacobian calcs with dvgeovsp took 5.482940673828125 seconds in total
updating the vsp model took 4.96399450302124 seconds
evaluating the new points took 0.5003964900970459 seconds
communication took 0.0013697147369384766 seconds
setting aero problem
*calling setAeroProblem*
-> Alpha... 0.000000
*calling updateGeometryInfo*
coordsAreEqual True
forces jacvec updatesMade: False
Solving linear in mphys_adflow
Solving ADjoint Transpose with PETSc...
Solving ADjoint Transpose with PETSc time (s) = 0.04
Norm of error = NaN Iterations = 0
------------------------------------------------
PETSc solver converged after 0 iterations.
------------------------------------------------
So we could add a residual call to the derivative routines of the mesh warping component, but my gut is that this issue will pop up again. I still believe it is better to just stop erasing radi-j-k
.
Very interesting. I think this may be also relevant for https://github.com/mdolab/adflow/pull/308 The current code in this branch does not have that fix, so the surface mesh is always getting updated...
I now agree that not deleting radijk might be the best way here. Deallocating and re-allocating those 6 arrays probably just cost for no reason.
I say lets do this: We can merge #308 now, and you can test your code again with those fixes. if you are satisfied, I am fine with approving this PR as is.
Merged that other PR. Can you please merge main (I am not doing it to avoid messing up your local installations) and test again? If you are still satisfied I can merge.
Merged that other PR. Can you please merge main (I am not doing it to avoid messing up your local installations) and test again? If you are still satisfied I can merge.
By test again do you mean let the CI run again or check if the nan error remains without this PR?
The CI wont be affected. Can you run your own MPhys derivative tests to make sure things are correct again after merging that last PR? The changes in that PR change the logic of geometry/surface updates, so the behavior will be different. I expect your fixes to still work, but wanted to double check.
Yep it still works, I just tested it.
Purpose
This PR fixes a subtle memory error caused by #308. After refactoring, the setting of volume and state data of the wrapper it is now possible to update the geometry data without updating the state data. This ought to be the way things are. However, the geometry update routines free some data while running to reduce the total memory usage and then reallocate the data once done (see
computeWallDistance
) .The
radi
,radj
, andradk
, data of each block was being erased and not being recomputed before running the reverse mode jac vec products. The valuesradi
,radj
, andradk
used sometimes were incorrect or NANs. Although this was not the case all the time as some optimizations did work for me.I fixed this by stopping
computeWallDistanc
from reallocatingradi
,radj
, andradk
by changing a flag. I'm not sure what other implications this has...Expected time until merged
A few days since the main branch is broken
Type of change
Testing
This is tricky, since sometimes the error doesn't occur. I suggest running a long optimization with many constraints.
Checklist
flake8
andblack
to make sure the Python code adheres to PEP-8 and is consistently formattedfprettify
or C/C++ code withclang-format
as applicable