wannier-developers / wannier90

Official repository of the Wannier90 code
http://www.wannier.org
GNU General Public License v2.0
237 stars 140 forks source link

Issue with disentanglement #192

Closed giovannipizzi closed 6 years ago

giovannipizzi commented 6 years ago

In the current develop, there is a problem with disentanglement with the attached (self-contained) inputs: strange_example.tar.gz

We get, when running in serial on parallel with 1 CPU (tested with gfortran 5.4 on Ubuntu 16.04, but the same behavior seems to be reproduced on a Mac with gfortran):

                   Extraction of optimally-connected subspace                  
                   ------------------------------------------                  
 +---------------------------------------------------------------------+<-- DIS
 |  Iter     Omega_I(i-1)      Omega_I(i)      Delta (frac.)    Time   |<-- DIS
 +---------------------------------------------------------------------+<-- DIS
       1      13.86935244      14.17982587      -2.190E-02      0.00    <-- DIS
       2      13.86935244      14.17982587      -2.190E-02      0.01    <-- DIS
       3      13.86935244      14.17982587      -2.190E-02      0.01    <-- DIS
       4      13.86935244      14.17982587      -2.190E-02      0.01    <-- DIS
       5      13.86935244      14.17982587      -2.190E-02      0.02    <-- DIS
       6      13.86935244      14.17982587      -2.190E-02      0.02    <-- DIS
       7      13.86935244      14.17982587      -2.190E-02      0.03    <-- DIS
       8      13.86935244      14.17982587      -2.190E-02      0.03    <-- DIS
       9      13.86935244      14.17982587      -2.190E-02      0.03    <-- DIS
      10      13.86935244      14.17982587      -2.190E-02      0.04    <-- DIS
      11      13.86935244      14.17982587      -2.190E-02      0.04    <-- DIS
      12      13.86935244      14.17982587      -2.190E-02      0.05    <-- DIS
      13      13.86935244      14.17982587      -2.190E-02      0.05    <-- DIS
      14      13.86935244      14.17982587      -2.190E-02      0.05    <-- DIS
      15      13.86935244      14.17982587      -2.190E-02      0.06    <-- DIS
      16      13.86935244      14.17982587      -2.190E-02      0.06    <-- DIS
      17      13.86935244      14.17982587      -2.190E-02      0.06    <-- DIS
      18      13.86935244      14.17982587      -2.190E-02      0.07    <-- DIS
[...]

Note that if run in parallel with e.g. 2 CPUs, the code 'hangs' after

                   Extraction of optimally-connected subspace    

Instead, in v2.1 it works and one would get:

                   Extraction of optimally-connected subspace                  
                   ------------------------------------------                  
 +---------------------------------------------------------------------+<-- DIS
 |  Iter     Omega_I(i-1)      Omega_I(i)      Delta (frac.)    Time   |<-- DIS
 +---------------------------------------------------------------------+<-- DIS
       1      13.86935244      13.57561115       2.164E-02      0.16    <-- DIS
       2      13.57127531      13.56862337       1.954E-04      0.17    <-- DIS
       3      13.56816150      13.56789982       1.929E-05      0.18    <-- DIS
       4      13.56784542      13.56781480       2.257E-06      0.19    <-- DIS
       5      13.56780834      13.56780471       2.675E-07      0.20    <-- DIS
       6      13.56780394      13.56780351       3.184E-08      0.21    <-- DIS
       7      13.56780342      13.56780337       3.792E-09      0.22    <-- DIS
       8      13.56780335      13.56780335       4.522E-10      0.23    <-- DIS
       9      13.56780335      13.56780335       5.397E-11      0.24    <-- DIS
      10      13.56780335      13.56780335       6.457E-12      0.25    <-- DIS
      11      13.56780335      13.56780335       7.774E-13      0.26    <-- DIS

             <<<      Delta < 1.000E-10  over  3 iterations     >>>
             <<< Disentanglement convergence criteria satisfied >>>

Note that in develop the Omega_I increases instead of decreasing.

giovannipizzi commented 6 years ago

I'm going to use git bisect to check when the problem was introduced.

giovannipizzi commented 6 years ago

I've run the git bisect. For future reference, I create at the top folder a strange_example folder with the files attached above; I also created a test_revision.sh with the following content:

#!/bin/bash

## Note: code 125 means the current code cannot be tested
#make -j clean
make -j wannier --silent || exit 125
cd strange_example || exit 125
rm -f gaas.wout || exit 125
wannier90.x gaas
cd .. || exit 125
grep "Maximum number of disentanglement iterations reached" strange_example/gaas.wout > /dev/null
if [ "$?" == "0" ]
then
    echo "$(git rev-parse HEAD): BAD" ; exit 1
else
    echo "$(git rev-parse HEAD): GOOD" ; exit 0
fi

and finally run the git bisect using:

git bisect start
git bisect bad develop
git bisect good v2.1
git bisect run ./test_revision.sh
giovannipizzi commented 6 years ago

The final result is

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
9125408b41c8fd4915097cfcbbade4044979e149
548d18eec8500b1f3f659c868b83bc9d1c279fa7
We cannot bisect more!
bisect run cannot continue any more

There is more than one because the problem is in those commits, but those commits weren't compiling... Quick link to the two commits:

Knowing the bug is in one of these two commits, we can try to check out the most recent of them (it's https://github.com/wannier-developers/wannier90/commit/548d18eec8500b1f3f659c868b83bc9d1c279fa7), see if it's easy to make it compile, and work on this.

giovannipizzi commented 6 years ago

Small correction: https://github.com/wannier-developers/wannier90/commit/548d18eec8500b1f3f659c868b83bc9d1c279fa7 compiles correctly so one can use that to test.

giovannipizzi commented 6 years ago

I also think that the following commit has an issue: https://github.com/wannier-developers/wannier90/commit/548d18eec8500b1f3f659c868b83bc9d1c279fa7#diff-a43c4b0311fff511fe0fcd43b64f8da8R858 I think that this should be zcopy and not dcopy as this is dealing with complex numbers. @mostofi @jryates do you agree?

However, this is not the source of the problem (even fixing this still has the same problem).

As an additional comment: the git bisect above was compiled in parallel (when available) and run with 1 CPU. The same result is obtained when compiling always in serial.

giovannipizzi commented 6 years ago

Additional debug with the help of GDB: in parallel, the code gets stuck here:

         if ( num_wann.gt.ndimfroz(nkp) ) then  
            call comms_gatherv(u_matrix_opt_loc,num_bands*num_wann*counts(my_node_id),&
                 u_matrix_opt,num_bands*num_wann*counts,num_bands*num_wann*displs)
            call comms_bcast(u_matrix_opt(1,1,1),num_bands*num_wann*num_kpts)    
         endif

which makes sense, because the if statement can give different results depending on whether the condition is true or false at the specific kpoint, i.e. at the specific node. I'm not sure I understand why this code block is inside an if...

giovannipizzi commented 6 years ago

Ok, I think this above is the problem. Executing always the statement seems to fix the bug. Also, the bug of dcopy vs zcopy has already been fixed in develop. I'm opening a PR for this.