githubdoe / DFTFringe

DFTFringe Telescope Mirror interferometry analysis Program.
GNU General Public License v3.0
166 stars 59 forks source link

github build DFTFringe crashes when calling zapm #129

Closed gr5 closed 8 months ago

gr5 commented 8 months ago

current version of DFTFringe in master (tagged V7.2.0 at the moment - the commit merged in today 2/27/2024) is crashing when built with github. But works fine when I build it on my computer with QT (both release and debug). The good news is that it is consistent and happens with only seconds of usage. I either simulate an annular igram or analyze an annular igram. simulate is faster to get to the crash for me.

It's crashing on this line of code in initgrid() https://github.com/githubdoe/DFTFringe/blob/094eca98b0dae466022827cd26d6a24adae74797/zernikeprocess.cpp#L1513

It never gets to zapm() but crashes when creating the parameters to zapm(). Within armadillo I think.

I need to do other things this afternoon but I'm going to try using armadillo 12.6.1 (Dale and I use I think 12.6.7) and see if it crashes. If it does, then that's good (I guess?) and we can change the build process.

gr5 commented 8 months ago

Actually I'm thinking maybe it crashes somewhere inside of zapm() but the stack gets messed up a bit. I think I'll create a new branch with more debug tracing and have github build that. It could be that this is an important bug to fix and just somehow doesn't crash when I build it.

I'll try armadillo 12.6.1 first and if that crashes then I'll try to step through the code.
If it doesn't crash then I'll create a debugging branch.

githubdoe commented 8 months ago

That use to crash there for me when I used the older Armadillo. IIRC. Are you using the latest one?

gr5 commented 8 months ago

Well it's the same version I'm using on my computer which doesn't crash at all. It just came out a few months ago. There are indeed newer versions as one came out about 3 weeks ago.

gr5 commented 8 months ago

Unfortunately I'm going to be pretty busy this week. I do want to work on this as I love puzzles so I will probably find the time at some point. Certainly this weekend I should have a few hours.

I have a few deadlines coming up this week.

gr5 commented 8 months ago

I made some progress. It is crashing on line 68 below.

https://github.com/githubdoe/DFTFringe/blob/094eca98b0dae466022827cd26d6a24adae74797/zapm.cpp#L51-L74

According to this page: https://arma.sourceforge.net/docs.html

(search for 10.5) the values of matrix J should be initialized to zero but it seems like when I debug using QT they are not and instead are something like 1E-300. I only finally got a few hours to debug. Will continue hopefully tomorrow.

githubdoe commented 8 months ago

Looks to me that J does not need any initialization. It's values are set in that little for loop and not dependent on any previous value.

Trying to think why this works for Mike, me and you, but not the build. I don't know anything about the build process however but there has to be some difference there. Perhaps in how it uses or makes Armadillo mat's.

atsju commented 8 months ago

@gr5 just some random thoughts, did you try to mix libarmadillo.dll from github builds (the master one + #130) with your local build ? Does it change something ? sorry, libarmadillo.dll is not distributed. It's builtin directly.

As you spotted that the crash happens in eig_sym, can you reproduce with a custom (small) matrix or with a minimal code ? If so, we could open a bug to Armadillo. Even if the matrix is big, you could use https://arma.sourceforge.net/docs.html#save_load_mat and load it at beginning to see if we can easily reproduce with always same values.

gr5 commented 8 months ago

@githubdoe - not all values are initialized - it's a 14X14 matrix and the 14 diagonal values are initialized and then in that loop another 13X2 or 26 values are initialized just bordering the diagonal. And then that's it. Most of the matrix is uninitialized. Next I'll print out all the values before it calls eig_sym(). I only had a few hours last night. I'm not sure what's on the schedule today but hope to work on this.

@atsju , making a smaller matrix is a good idea. I'm going to see the values and maybe duplicate them or maybe zero them out. Supposedly starting version 10.5 (came out in 2008) of armadillo zeros the matrix by default but it's not doing that in QT debug mode and instead all the values are like 1E-315 but all slightly different values.

I specifically added fill::zeroes as a 3rd parameter when creating the matrix and if I remember right they still weren't zero. Confusing.

atsju commented 8 months ago

according to this site https://baseconvert.com/ieee-754-floating-point the smallest double representation in IEEE754 is around 0.5E-323 so I'm not too shocked. But on the other hand, double is supposed to handle true 0. How are you printing the matrix ? Maybe the matrix is zeroed out but the print is failing it ?

gr5 commented 8 months ago

Another thing I don't understand: armadillo uses an underlying math package. Either LAPACK, BLAS, or MKL. And by the way there are multi-threaded speedups that we could be using maybe? I'm not sure - maybe we are.

Anyway, we are using lapack. Maybe it matters what version of lapack is being used.

I doubt it. I think the issue is the un initialized values in the array. Will prove it soon hopefully.

atsju commented 8 months ago

This is interesting. On the online build, I use neither LAPACK or BLAS or anything. Maybe we rely on something that we dont have ? Check the build here : https://github.com/githubdoe/DFTFringe/actions/runs/8138263602/job/22238662884

image

If you tell me how you build armadillo locally, I can fix the build process. I suppose we never had bigger problem because it works in "degraded" mode and we did never degrade enough to fail.

atsju commented 8 months ago

OK I think I can confirm the issue comes from build process: This comes from Armadillo documentation: image

I will modify the build process to add Lapack as soon as I have some time. Probably during the week. @gr5 no need to spend more time on this issue for you. If you can mail me your armadillo+lapack build process in short, it would probably help me and speed the thing up.

Edit: It would also be great if you describe how to test. This ways I can let you know when the build process is indeed working.

gr5 commented 8 months ago

to test

To test, go into mirror config and make sure "annulus" box is checked. Set mirror diameter to 50% which should set the diameter to match half the mirror diameter. Then go to "simulations" "igram", click "ok" (doesn't matter what settings are). crash every time.

the problem

Almost 90% sure the problem is the missing libraries lapack.dll and libblas.dl. We don't need libarmadillo.dll but we need at least one of the other 2.

I've never built these but there is a document by Franz in the dftfringe code root folder: https://github.com/githubdoe/DFTFringe/blob/master/How%20to%20Build%20DFTFringe.pdf

githubdoe commented 8 months ago

For my build I don't use any armadillo .dll or external build of it. All that was needed were the includes which then get used and built by the compiler as needed.

atsju commented 8 months ago

For my build I don't use any armadillo .dll or external build of it. All that was needed were the includes which then get used and built by the compiler as needed.

Yes correct. what is missing here is lapack+blas. And Armadillo must be configured correctly to use it.

While the online build indeed has LAPACK, the Armadillo build is not configured to use it. Thus the include files are not making use of it and it fails for functions depending on lapack like eig_sym. I will fix it.

githubdoe commented 8 months ago

Thanks all. You guys are too fast for me. I had just realized after my last message that I use lpack dll and was looking to confirm it. Then for what ever reason. Qt creator disappeared from my windows main screen and needed to go find it. Then a doctor called and needed to talk to me. ..... Finally I'm back and yep I use an LPAck that I compile. I can provide more info if needed.

gr5 commented 8 months ago

By the way, before I posted here today I did try printing out all the values in that matrix and I was wrong - they are all initialized to zero as they should be in any version of Armadillo since 2008. I was a little confused by the QT debugger and thought some of the values were non-zero. It's really just that certain armadillo functions (which we have never used until annulus features) need to link up to lapack, specifically the eig_sym() function.

atsju commented 8 months ago

I compile Lapack too. The problem I have is how to compile armadillo so that it knows to use Lapack.

Maybe my lapack build is bad, maybe my armadillo build is bad, maybe both. I'm investigating but if you guys know how to build reliably, documenting it would help. The document done Franz seems incomplete but maybe I missed something.

githubdoe commented 8 months ago

I use the instructions in the .pdf "how to build DFTFringe" They work just fine for me. Of course you need to be a little fimilar with CMake.

gr5 commented 8 months ago

I compile Lapack too.

Oh! Well it's not in the release package. Maybe it's as simple as that. Include liblapack.dll in the release package. And libblas.dll as well.

atsju commented 8 months ago

Oh! Well it's not in the release package. Maybe it's as simple as that. Include liblapack.dll in the release package. And libblas.dll as well.

Nope That's not enough.

I have part of the answer now. I missed this from documentation image So at least armadillo is configured correctly

image

But I get errors on armadillo compilation now image

working on it...

atsju commented 8 months ago

I think I'm done for today.

Can one of you please ELI5 how you build lapack and armadillo. Remember we not all use the same tools and github action is like doing it on a fresh computer each time and only using command line. You probably think it's easy (and it probably is on your setup) but there are plenty of little undocumented things because you don't remember you need it first time or because the IDE abstracts everything for you.

I can figure out how to do everything in command line but as of now, I'm not able to do it at all. I would really appreciate if you share your experience but with all details. Capture intermediate build outputs or make a video so that I can compare. I'm interrested in both lapack and armadillo. Thanks a lot.

githubdoe commented 8 months ago

For me there is nothing to do for Armadillo other than to copy the source to my machine and make sure the include path is in the Qt project.

Lpack is made by using the windows CMake program and telling it the path to the source and your compiler as described in the how to build dftfringe pdf. It then creates the make file for me that I then use "Make" on IIRC. So I can not help you with any command line way to do it.

gr5 commented 8 months ago

I've never built any libraries for DFTF. I just use Dale's version of those dlls. I got them from his old installers. And then in the QT project file there is this:

LIBS += D:\lapack\build64\bin\liblapack.dll
LIBS += D:\lapack\build64\bin\libblas.dll

Which I believe tells the linker to somehow link calls to those functions to those dlls. Which doesn't make sense from when I learned how to build things in windows. In the "old" days we had a ".lib" file that you linked against that was paired up with a dll. Now the lib file is somehow inside the dll? Maybe? I don't know.

Did you look at these instructions here? https://github.com/githubdoe/DFTFringe/blob/master/How%20to%20Build%20DFTFringe.pdf

githubdoe commented 8 months ago

Yes a .dll is an advanced from of a .lib. One that lets several programs use the single copy held in local memory of the operating system instead of each program have it's own copy.

atsju commented 8 months ago

For me there is nothing to do for Armadillo other than to copy the source to my machine and make sure the include path is in the Qt project.

That's not the correct approach as you do not configure Armadillo. However I had a look and default everything should work so I might go this way to solve fast. I understand better the Armadillo configure/build process now.

I've never built any libraries for DFTF. I just use Dale's version of those dlls. I got them from his old installers

You must probably use the latest things from online build

Now the lib file is somehow inside the dll? Maybe? I don't know.

For Lapack, I indeed built .a instead .dll and thus they are linked statically into DFTFringe. Which is not the problem here. The problem is that my Armadillo build is configured wrong and Armadillo will suppose that it cannot rely on Lapack and Blas (even if it's available as .a or DLL). To build DLL instead .a I need to set BUILD_SHARED_LIBS https://icl.utk.edu/lapack-for-windows/lapack/index.html#build I might fix this later but as I said, it should just work same with .a.

Did you look at these instructions here?

Of course. As every instructions I have found they are more or less incomplete. Also, there are plenty of different ways to build the things that will work more or less correctly. That's why I insisted on the online build. To have something reliable and reproductible.

Conclusion: I'm making progress and will let you know when it works. #120 is incredibly important IMHO. Building DFTFringe on a fresh computer should take 2h without support, not be a motivation/performance test for futur contributors 😉

gr5 commented 8 months ago

Yes a .dll is an advanced from of a .lib. One that lets several programs use the single copy held in local memory of the operating system instead of each program have it's own copy.

I know this. I've written maybe 30 dlls from scratch. Using the old method and the "newer" (from 1990s?) "activeX" method (later called ".net") where they have to be registered. And written code to use dlls. I've dealt with the confusion where my code is using the wrong dll. But if my memory is correct, I always had to have a lib file to build the programs that used the dlls. Barring that I had to write extra code that manually loads the dll and retrieves call points so my program knows how to call those functions. At least this is how I remember it. I may remember it a bit wrong. But I'm pretty sure I've never been able to link to a dll with a linker. I always had to have a lib file.

atsju commented 8 months ago

You know what that means image