Closed giovannipizzi closed 6 years ago
I'm going to use git bisect
to check when the problem was introduced.
I've run the git bisect
. For future reference, I create at the top folder a strange_example
folder with the files attached above; I also created a test_revision.sh
with the following content:
#!/bin/bash
## Note: code 125 means the current code cannot be tested
#make -j clean
make -j wannier --silent || exit 125
cd strange_example || exit 125
rm -f gaas.wout || exit 125
wannier90.x gaas
cd .. || exit 125
grep "Maximum number of disentanglement iterations reached" strange_example/gaas.wout > /dev/null
if [ "$?" == "0" ]
then
echo "$(git rev-parse HEAD): BAD" ; exit 1
else
echo "$(git rev-parse HEAD): GOOD" ; exit 0
fi
and finally run the git bisect
using:
git bisect start
git bisect bad develop
git bisect good v2.1
git bisect run ./test_revision.sh
The final result is
There are only 'skip'ped commits left to test.
The first bad commit could be any of:
9125408b41c8fd4915097cfcbbade4044979e149
548d18eec8500b1f3f659c868b83bc9d1c279fa7
We cannot bisect more!
bisect run cannot continue any more
There is more than one because the problem is in those commits, but those commits weren't compiling... Quick link to the two commits:
comms_gatherv_cmplx
it's quite small, I'll check if there is a bug there, but I hope notKnowing the bug is in one of these two commits, we can try to check out the most recent of them (it's https://github.com/wannier-developers/wannier90/commit/548d18eec8500b1f3f659c868b83bc9d1c279fa7), see if it's easy to make it compile, and work on this.
Small correction: https://github.com/wannier-developers/wannier90/commit/548d18eec8500b1f3f659c868b83bc9d1c279fa7 compiles correctly so one can use that to test.
I also think that the following commit has an issue:
https://github.com/wannier-developers/wannier90/commit/548d18eec8500b1f3f659c868b83bc9d1c279fa7#diff-a43c4b0311fff511fe0fcd43b64f8da8R858
I think that this should be zcopy
and not dcopy
as this is dealing with complex numbers.
@mostofi @jryates do you agree?
However, this is not the source of the problem (even fixing this still has the same problem).
As an additional comment: the git bisect
above was compiled in parallel (when available) and run with 1 CPU. The same result is obtained when compiling always in serial.
Additional debug with the help of GDB: in parallel, the code gets stuck here:
if ( num_wann.gt.ndimfroz(nkp) ) then
call comms_gatherv(u_matrix_opt_loc,num_bands*num_wann*counts(my_node_id),&
u_matrix_opt,num_bands*num_wann*counts,num_bands*num_wann*displs)
call comms_bcast(u_matrix_opt(1,1,1),num_bands*num_wann*num_kpts)
endif
which makes sense, because the if
statement can give different results depending on whether the condition is true or false at the specific kpoint, i.e. at the specific node. I'm not sure I understand why this code block is inside an if
...
Ok, I think this above is the problem. Executing always the statement seems to fix the bug.
Also, the bug of dcopy
vs zcopy
has already been fixed in develop
.
I'm opening a PR for this.
In the current develop, there is a problem with disentanglement with the attached (self-contained) inputs: strange_example.tar.gz
We get, when running in serial on parallel with 1 CPU (tested with gfortran 5.4 on Ubuntu 16.04, but the same behavior seems to be reproduced on a Mac with gfortran):
Note that if run in parallel with e.g. 2 CPUs, the code 'hangs' after
Instead, in v2.1 it works and one would get:
Note that in develop the
Omega_I
increases instead of decreasing.