Closed MoZeWei closed 4 years ago
Thanks for reporting this issue. First, I have a few questions.
What version of VPIC are you using? GitHub master or devel branch? Or a modification of either the master or devel branch?
What hardware are you running on where you experience the problem?
Can you tell us anything about the nature of the problem input deck? Do you have an input deck that you can share with us that reproduces the problem?
Related to 3) above, why are you trying to improve the performance of move_p? Is the nature of the problem you are trying to solve such that move_p dominates the performance?
At what scale are you running at where this problem happens? Single node or multiple node? If multiple node, are you running at small scale or large scale?
What compiler are you using to build VPIC?
Second, there are some things you can try to debug the problem.
Try running your problem using the v4_portable version. When running with the v4_portable version, do you still get NaNs? If not, then you can selectively replace the v4 intrinsics wrapper functions in the v4 header file you are using with the portable implementation of the wrapper until the NaNs go away. That should help identify the intrinsics wrapper that is causing the problem.
If you still get NaNs when running with the portable version, then try running without v4 support at all. Hopefully, you will not get NaNs in this scenario. If you do, we will need to dig deeper.
Hope this helps,
Dave
Thanks for paying attention to this. Replies to your questions are below: For 1. Modified code of move_p and advance_p_pipeline is cloned from master branch. The src code I have was written by KJB in 2004 and changed by some people, but it still works while I use move_p which doesn't use any functions defined in v4_sse.hxx. (It seems that there are many versions of vpic's code applied on various problem and the one I have is one of those.) For 2. and 5. I ran this simulation on multiple node at large scale and each of them has dual E5-4640 and 64GB of RAM. For 3. I do have a sigma.cxx which contains the input deck. How should I share this with you? For 4. It's because our simulation spends a lot of time on move_p( ) and I want to apply it to multiple particles instead of one particle at one time. For 6. I am using mpich-3.2.1 with icc-14.0.2(gcc version 4.8.0 compatibility) I hope my description can help you understand my questions and maybe there are something I didn't say it clear or right enough, so I am looking forward to your reply or doubts. Also, I will try to apply your suggestions to solving the problem and tell you the result as soon as possible. Thanks a lot.
I found out that v4_portable.h didn't use any SSE intrinsics which is what I need to accelerate the move_p( ). Also, I passed the local_pm to move_p as a parameter everywhere it's called like what is done in advance_p( ) to satisfy the alignment requirement of SSE, and it still didn't work and got NAN. So I am confused now again. And here, I found something which are different from the result of original program. While it was running, a lot of move_p( ) returned 1 and I got WARNING like " pipeline has run out of mover storage ..." which I didn't get in the original good-working program. Interestingly, this happened a lot before I got the first NAN data in advance_p( ). Maybe this is because of the mistake of memory alignment? I am new to h.p.c., and I sincerely hope that we can discuss about it and work it out. Thanks.
Thanks for reaching out.
You can add the file as a gist (https://gist.github.com/)
Thanks again for reaching out. It would be useful to see your input deck so we can see what sort of problem you are running that results in lots of calls to move_p. Vectorizing move_p over particles could be tricky for a variety of reasons. There are various efforts in progress to address performance issues like this but nothing that could be shared at this time. What percentage of your run time is spent in move_p? The hardware you mentioned supports 256 bit AVX which would give you a vector length of 8 - so, I'm not sure you would get much benefit from the extra vector length since move_p with v4 already supports vector length of 4 although the effective vector length is more like 3. We have added support for additional intrinsics on the current github master version i.e. AVX, AVX2 and AVX-512. It might be useful to sync your version of VPIC up with the latest version that is in github. If the pedigree of your current version is old enough, it might go back to what we called the "v407" version which is no longer supported. There were some minor API changes between the older v407 variants and the current version in github.
Are you also able to provide log file output for a single node run? It would be interesting to see your profiling timing statistics to see where your type of runs are spending most of their time.
Thanks,
Dave
Thanks for reaching out.
You can add the file as a gist (https://gist.github.com/) Thanks. But it is banned in china mainland, I will add files via gist later if needed.
Hi, now I have tried v4_portable.h to make it work, but NANs still show up. I don't think this is caused by the memory alignment. I track down some particles in move_p and it seems ok.
Ok, problem solved. I found that adding -mfpmath=sse while compiling can cause the mistakes which happen when using registers. Thank you guys.
Glad you found the problem. Do you get much of a speedup compared to using the old V4 approach which vectorizes over the vector components?
Thanks,
Dave
Yes. It's like 10% in a small scale. Oh, By the way, can I ask you a question? How do you check the correctness of results after you change some parts of the program? I am looking for a fast and reliable way to do it.
I consider the non-vectorized scalar version in GitHub to be my "correctness" standard that has been validated by VPIC users to give correct physics results. I then run my problem in the scalar version and print out the particle coordinates including the cell index for some number of particles. Often I choose the number to be a thousand or so. Then, I do the same with my "changed" version and compare the values of the particle coordinates. They should be nearly identical for the first few time steps. You have to be careful to choose a problem that does not do any random number generation within a time step because if the order of processing your particles changes, then the order in which random numbers get assigned can change. If the order in which particles get processed changes between your changed version and the reference version, you will probably need to sort your particle coordinate output to make it easy to compare. I generally print out text files and diff them with a visual diff tool like xxdiff or meld. I usually have a branch I am working on with some name like devel and have another branch derived from it called devel_cout where I keep my print statements. That way, I don't have to worry about cluttering up my devel branch with a bunch of print statements that I will later have to clean up before merging my devel branch back into a release candidate. I have found this to be a good strategy for testing changes that basically fit into the category of optimizations because optimizations should not change the answer other than as a result of the usual issues of finite precision math and non-determinate order of operations. For good, stable numerical algorithms, those limitations should not cause significant differences for small numbers of time steps but bugs will.
There are other approaches that can be considered when developing new physics algorithms instead of just optimizing an existing implementation that is presumed correct. One approach is the method of manufactured solutions. I have gotten a lot of benefit out of that approach in the past but it often requires developing some infrastructure for your application.
Finally, the physics results and things like degree of conservation of various physical quantities like energy are the final arbiters of correctness. So, in summary, I have gotten good results with my first approach when I am just optimizing but not changing the actual solution. I also believe there can be an art to being able to design good test problems when they are needed.
It seems that your example should fit into the optimization category.
Hope that helps,
Dave
Thanks for such a great suggestion. Here are some doubts:
Your answer is definitely very brilliant and I learned a lot from it. Hope you can provide me the answer to my doubts. That will help a lot. Thanks.
Best wishes.
I am trying to improve the performance of move_p, and when I use move_p_v4( ),which is the move_p( ) when V4_ACCELERATION is defined, some data are calculated wrong and after some steps(like 44 steps in my case), NAN shows up in interpolator's data, it influences all data I use and gives the wrong answer in following calculating. I tried to use gdb to trace the bug, but it doesn't seem to be obvious and it's hard to find out why, what can I do?