DOS changes with number of processors

ajm143 commented 3 years ago

Bug chasing (RJN and AJM 15/12/20)

ome v dome : both exhibit problems -- but don't give identical errors.

Up spin changes with number of processes on adaptive and linear.

Possible locations:

CASTEP itself read in and allocate to different k-point arrays do the DOS calculations merge write out

Clues:

140 k-points same answer in 1,4,5,7,10, all different (And from each other) in 3,6,12 56 k-points same answer in 1,4 different in 10. Happens also when spin up=spin down (in the 1 core case) spin up <> spin down in multicore.

ajm143 commented 3 years ago

Can reproduce this bug on the Si2__DOS example using ifort debug in TCM. Ho hum, it's real...

ajm143 commented 3 years ago

Found it! The bug is a particularly nasty array shape passing error in electronic.f90 the band_gradient array gets mangled on send to any node that has a different number of k-points than the master node.

Description

This happens whenever nkpoints mod nodes =! 0. The master then has one more k-point than some of the slaves. However the master has read into a band_gradient array that has num_kpoints_on_node(0) not num_kpoints_on_node(slave_node). Hence it never writes to the final k-point in band_gradient but then sends the array with length nbands*3*nspins*num_kpoints_on_node(inode) not nbands*3*nspins*num_kpoints_on_node(0). This causes the is=1 to be misaligned (the spins are after the k-points in the array list)

Proposed Solution

Make all of the band_gradient arrays' k-point dimension num_kpoints_on_node(0) not num_kpoints_on_node(my_node_id). This does increase the memory requirements on some slave nodes, but they should handle this since the master can.

Issues

Presumably any other array read in in the same way will suffer the same problem.

[ ] electronic.f90: call comms_recv(band_gradient(1, 1, 1, 1), nbands*3*nspins*num_kpoints_on_node(0), root_id)
[ ] electronic.f90: call comms_recv(optical_mat(1, 1, 1, 1, 1), nbands*nbands*3*nspins*num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(band_energy(1, 1, 1), nbands*nspins*num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(kpoint_r(1, 1), 3*num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(kpoint_weight(1), num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(band_energy(1, 1, 1), nbands*nspins*num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(kpoint_r(1, 1), 3*num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(kpoint_weight(1), num_kpoints_on_node(my_node_id), root_id)
[ ] electronic.f90: call comms_recv(elnes_mat(1, 1, 1, 1, 1), elnes_mwab%norbitals*elnes_mwab%nbands*3*nspins* &
[ ] electronic.f90: call comms_recv(pdos_weights(1, 1, 1, 1), pdos_mwab%norbitals*pdos_mwab%nbands* &

optados-developers / optados