VTune report and Performance optimization [WIP]

chemphys commented 7 years ago

I will proceed to run VTune and see the bottlenecks of the computation.

chemphys commented 7 years ago

Hey @dgasmith ,

I got VTune to work. I am following the steps here , and for now I made a collect hotspots. I guess now I can get the reports.

Let me know how do you want me to send you the reports, and which kind of reports do you want. I will keep playing with it.

See you!

chemphys commented 6 years ago

Hey @dgasmith ,

When we were talking about looking into the scheduling, you meant using dynamic, static or guided, or adding chunks? I think I found a straight forward tutorial in which they put examples of everything. I will play around with it. Thanks for the advice yesterday!

dgasmith commented 6 years ago

That looks like a great starting tutorial! Ill keep that link for myself to be honest :)

When it comes to scheduling it really depends on your system. Static is the best, but it really requires you to know that each thread is going to take the exact same amount of time. I typically start out with guided these days as my compute is rarely that that even.

When it comes to VTune the first thing is to verifying your most costly sections and work on those at a fairly high level (including threading them). We can dive deep down on the vector and cache reports at a later time.

chemphys commented 6 years ago

@dgasmith , I have been trying to make the parallel version to work, but there is something odd. When you run with 1 core, the code runs fine. However, running with more than one yields a dipole divergence, which means that the values of the field are not set properly.

If you look here you will see that I added the flag and reorganized the structure to do as many things with one core as possible. All the modifications I have made are in that loop. I create variables inside the loop (in principle private for each thread), I fill them, and then I merge them together using the atomic pragma in here. Using a single core, this loop works, but with more than one it doesn't. Can you have a look at this and see if there is something conceptually wrong that you can identify? Thanks!

dgasmith commented 6 years ago

Sure, your biggest issue is that your "Field" object has internal data which is causing race conditions. You will need one of these objects per thread as the first part of those calls is to zero all of the data.

Something like:

nthreads = 1;
#ifdef _OPENMP
     nthreads = omp_get_num_threads();
#endif
std::vector<std::unique_ptr<Field>> = field_pool;
std::vector<std::vector<double>> = Efd_2_pool;
for (size_t i = 0; i < nthreads; i++) { 
   field_pool.push_back(std::make_unique<Field>(...));
   Efd_2_pool.push_back(std::vector<double>(max_size, 0.0));
}

Then inside the for loop, pick out a single object per thread:

// Get thread info
int rank = 0;
#ifdef _OPENMP
    rank = omp_get_thread_num();
#endif
std::unique_ptr<Field> local_field = field_pool[rank]
local_field->DoEfdWoA(...);

Three more notes:

Please rename things like "Field" and "DoEfdWoA" to something more descriptive. What "Field" are we talking here Magnetic, Football, etc? Not great to have 150 character names, but were not limited to Fortran level abbreviation here either :)
If you are passing doubles in as reference please do that manually instead of relying on the compiler and your function signatures to work itself out. As a note returning a std::tuple<double, double, double> could be perfectly fine as well.
- In general its not good to allocate data within a OMP region (hence my suggestions above).
- Dont forget to ifdef your OMP when needed (non-OMP pragmas are just ignored).

chemphys commented 6 years ago

Damn, I totally missed the field class... Thanks! I will fix this. I will keep pushing to the pull request.

Thanks a lot!!!

chemphys commented 6 years ago

Hey @dgasmith , I put here a few updates.

I have renamed the classes and functions, so is more explicative
I rewrote the function definition of the ElectricFieldHolder class (old name was Field) so they take pointers instead of doubles and I pass the reference. I don't know if that is what you had in mind. If not, I can make them functions and return a tuple as you suggest.
I ifdefed some of the OMP pragmas, and I am looking for the other ones that might be there.
@bella-le completed the implementation of the other 8 ions in the code. We can now run energy calculations of all the alkali metal ions and halide ions in water. The ion-ion potentials are still not implemented, but will be the next step once we have the performance solved
Together with @bella-le , we will start seeing if we can vectorize manually a polynomial file. We will start with toy systems such as simple loops with arithmetic operations, and then we will try to write a script that converts the polynomial files in a vectorized file. Since this is not crucial, we will take our time to learn about manual vectorization.
Parallel implementation of the most expensive part of the code is done and working. I am having a problem with the intel compiler license in my machine, but Brendan is working on figuring out what is going on. The tests I have run are using the g++ compiler. I will see how is the speedup in that concrete part of the code, and make some scaling plots. I will compile the code in the UCSD supercomputer, which has up to 16 cores/node. That will give a good estimate of scaling.

For now, I will focus on obtaining information of the performance. The electrostatics function can still be improved a little bit.

Let me know if there is anything you think I should do!

dgasmith commented 6 years ago

Sounds good to me. One word of caution is if that 16cores/node is split over two physical chips you will have some interesting NUMA issues with your scaling that you likely have not encountered yet.

-Daniel Smith Sent from my iPhone.

On Nov 30, 2017, at 11:50, Marc Riera Riambau notifications@github.com wrote:

Hey @dgasmith , I put here a few updates.

I have renamed the classes and functions, so is more explicative I rewrote the function definition of the ElectricFieldHolder class (old name was Field) so they take pointers instead of doubles and I pass the reference. I don't know if that is what you had in mind. If not, I can make them functions and return a tuple as you suggest. I ifdefed some of the OMP pragmas, and I am looking for the other ones that might be there. @bella-le completed the implementation of the other 8 ions in the code. We can now run energy calculations of all the alkali metal ions and halide ions in water. The ion-ion potentials are still not implemented, but will be the next step once we have the performance solved Together with @bella-le , we will start seeing if we can vectorize manually a polynomial file. We will start with toy systems such as simple loops with arithmetic operations, and then we will try to write a script that converts the polynomial files in a vectorized file. Since this is not crucial, we will take our time to learn about manual vectorization. Parallel implementation of the most expensive part of the code is done and working. I am having a problem with the intel compiler license in my machine, but Brendan is working on figuring out what is going on. The tests I have run are using the g++ compiler. I will see how is the speedup in that concrete part of the code, and make some scaling plots. I will compile the code in the UCSD supercomputer, which has up to 16 cores/node. That will give a good estimate of scaling. For now, I will focus on obtaining information of the performance. The electrostatics function can still be improved a little bit.

Let me know if there is anything you think I should do!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

chemphys commented 6 years ago

I think that TSCC has 16 real cores per chip, but I am not sure. I will take that into account. I will post here the scaling when is done! Thanks!

chemphys commented 6 years ago

Hello @dgasmith , @fpaesani and @agoetz , I put here a little bit of an update of the parallelization of the code.

As you all know, I have been working on improving the performance of the software @dgasmith and I are developing. After making the electrostatics as fast as we can for now using one core, we proceeded to add parallelization in the most expensive region of the code, which is the iterative part in which the dipole electric field is computed. Concretely, the function that is taking most of the time in the electrostatics is the one that loops over different monomers and gets the contributions to the dipole electric field (this one). Thus, I parallelized that function with the help of @dgasmith .

In order to see how good is the speedup for this function, I added some timestamps in the code, and run it for 3 different sizes (256, 1024 and 4096 water molecules). The plot attached shows the speedup of this part of the electrostatics function (intermolecular dipole electric field) as a function of the number of cores used for the three system sizes.

timings

I don't know if we were expecting something better, but to me it seems pretty good. I also attach the data of this plot in dat files. Each of the three files (look at the file name for the system) contains 5 columns:

Number of threads
Speedup of the code
Ideal speedup
Time that the code took to run the intermolecular dipole electric field calculation per cycle (which is the total time of that part divided by the number of cycles)
Time that the code would have taken if the speedup was the maximum possible.

dipole_electric_field_data.zip

I also attach the different contributions of the different parts of the electrostatics in terms of timing for one thread. Each file (look at name for the system size) contains the different contributions corresponding to:

setup: Time taken to initialize classes, vectors, variables and reorganization of the data so it's contiguous
intra_monomer_permanent_elecField: Calculation of the contributions to the permanent electric field due to sites in the same monomer (0 for water-ion systems since all sites are excluded)
inter_monomer_permanent_elecField: Calculation of the contributions to the permanent electric field due to sites in different monomers
Induced_dipole_calculation: calculation of the new guess of the dipole with the updated electric field.
Induced_dipole_intra_dipole_field: contributions to the dipole field due to sites in the same monomer
Induced_dipole_inter_dipole_field: contributions to the dipole field due to sites in different monomers
gradients: time spent to compute the gradients
data_reorganization: reorganization of the data to have it in the same way as was input to the electrostatics function
virtual_sites_grad_redistribution: redistribution of the gradients in the virtual sites to the real sites.

electrostatics_timing.zip

As you can see, the intermolecular dipole electric field takes practically all the time.

The tests were done in my local machine, which has 8 real cores, compiled with Intel 2017.

dgasmith commented 6 years ago

Cool, looks pretty good for 8 cores. In general I am pretty happy with 90%+ efficiency. We need to check higher core and chip counts, it would be good to do as well on 2x 16+ core chips.

chemphys commented 6 years ago

I will look at the architecture of Stampede2 (TACC). It has 64 cores, but I have no idea how many chips. I have to look in the documentation.

chemphys commented 6 years ago

I finally made the code compile in Stampede. I was having some problems, but with help of the people there is done. I have been looking in the documentation and it seems that Stampede2 has 68 cores in the same socket (If you can double check page 3 of that document to confirm, I would appreciate it!)

These are the speedups I obtain:

timings

As you can see, the dynamic schedule seems to perform better. However, something interesting: for 256 water molecules I was getting a reasonable speedup in my machine with 8 cores, but on Stampede2 seems that it doesn't do it as well. The code is compiled in the same way, and the input files are exactly the same. We can discuss this later in the meeting.

See you!

dgasmith commented 6 years ago

Ah, KNL architecture is quite different than Xeon so your milage will vary. See here for some docs and the pdf you linked is quite informative as well. I was hoping to optimize for a NUMA Xeon machine first, but we can work on KNL as well. This will mostly come down to minimizing data movement, which is a good thing to do regardless of architecture.

chemphys commented 6 years ago

Yeah, I already saw that pdf. The documentation in the TACC website is a little bit more clear to me. Maybe we can do the optimization thinking in the xeon machines first, since all the other supercomputers (I think) we have available (tscc , see Hardware info at the end, comet , DOD machines ...) all have xeon nodes.

chemphys commented 6 years ago

Closing this issue. I think we are where we want in performance for now.

paesanilab / MBX

VTune report and Performance optimization [WIP] #29