libAtoms / QUIP

libAtoms/QUIP molecular dynamics framework: https://libatoms.github.io
347 stars 121 forks source link

gap_fit MPI Segmentation fault #636

Closed MES-physics closed 5 months ago

MES-physics commented 5 months ago

Dear Developers, Please tell me what the usual problem is with this? I got the same type of error using both mpirun and srun. Trying to start gap_fit training. Last year I used 4 nodes with 64 ntasks per node, and it was working. I used the 2 -step process as before, sparsification first, then MPI run. Input file attached for the MPI run. Thanks.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

This is my MPI program :

libAtoms::Hello World: 2024-04-08 14:53:24
libAtoms::Hello World: git version  https://github.com/libAtoms/QUIP.git,v0.9.14-9-gcaff15489-dirty
libAtoms::Hello World: QUIP_ARCH    linux_x86_64_gfortran_openmpi+openmp
libAtoms::Hello World: compiled on  Mar 27 2024 at 11:08:38
libAtoms::Hello World: MPI parallelisation with 256 processes
libAtoms::Hello World: OpenMP parallelisation with 8 threads
libAtoms::Hello World: OMP_STACKSIZE=1G
libAtoms::Hello World: MPI run with the same seed on each process
libAtoms::Hello World: Random Seed = 837779713
libAtoms::Hello World: global verbosity = 0

Calls to system_timer will do nothing by default
[gap_fit_mpi.txt](https://github.com/libAtoms/QUIP/files/14910528/gap_fit_mpi.txt)
albapa commented 5 months ago

@Sideboard has added some code that eliminates the need for the two-step process!

albapa commented 5 months ago

Which step fails? The sparsification or the fit?

MES-physics commented 5 months ago

The fit. Sparsification worked fine.

MES-physics commented 5 months ago

Yes I know about the change to one step, but haven't figured out how to use it yet.

albapa commented 5 months ago

The fit. Sparsification worked fine.

The mistake must be in the command line - at least it should print it back. Using the config_file mechanism is highly recommended.

Yes I know about the change to one step, but haven't figured out how to use it yet.

As far as I know it's as simple as submitting a single MPI job.

MES-physics commented 5 months ago

It does print the command line back. I'll look further.

MES-physics commented 5 months ago

Now I tried doing MPI all-in-one, and got the same segmentation fault. Here is a sample from the *.err file, with the command line feedback. I am using the exact command line I used last year with only the datafile change, trying to run on 2 nodes. Thanks for any advice.

#0  0x14d56297bc1f in ???
#1  0x14d563cfedf7 in ???
#2  0xbb5509 in ???
#3  0xbb5044 in ???
#4  0xbb4ce5 in ???
#5  0xa9e0d8 in ???
#6  0x41aa80 in ???
#7  0x40a292 in ???
#8  0x409b1e in ???
#9  0x14d5625c7492 in ???
#10  0x409b5d in ???
#11  0xffffffffffffffff in ???
./gap_fitMPI.sh: line 63: 1568279 Segmentation fault      /home/QUIPMPI/QUIP/build/linux_x86_64_gfortr
an_openmpi+openmp/gap_fit atoms_filename=Cours3909_238.xyz at_file=Cours3909_238.xyz gap = {distance_2b n_spars
e=15 theta_uniform=1.0 sparse_method=uniform covariance_type=ard_se cutoff=6.0 delta=2.0 : angle_3b n_sparse=20
0 theta_uniform=1.0 sparse_method=uniform covariance_type=ard_se cutoff=2.5 delta=0.05 : soap n_max=12 l_max=4 
atom_sigma=0.5 zeta=4.0 cutoff=6.0 cutoff_transition_width=1.0 central_weight=1.0 n_sparse=9000 delta=0.2 covar
iance_type=dot_product sparse_method=cur_points radial_decay=-0.5} default_sigma={0.001 0.01 0.05 0.0} default_
kernel_regularisation={0.001 0.01 0.05 0.0} energy_parameter_name=energy force_parameter_name=forces virial_par
ameter_name=virial do_copy_at_file=F sparse_jitter=1.0e-8 sparsify_only_no_fit=F sparse_separate_file=T openmp_
chunk_size=10000 gp_file=Cours238.xml core_ip_args={IP Glue} core_param_file=r6_innercut.xml config_type_sigma=
{Liquid:0.050:0.5:0.5:0.0: Liquid_Interface:0.050:0.5:0.5:0.0: Amorphous_Bulk:0.005:0.2:0.2:0.0: Amorphous_Surf
aces:0.005:0.2:0.2:0.0: Surfaces:0.002:0.1:0.2:0.0: Dimer:0.002:0.1:0.2:0.0: Fullerenes:0.002:0.1:0.2:0.0: Defe
cts:0.001:0.01:0.05:0.0: Crystalline_Bulk:0.001:0.01:0.05:0.0: Nanotubes:0.001:0.01:0.05:0.0: Graphite:0.001:0.
01:0.05:0.0: Diamond:0.001:0.01:0.05:0.0: Graphene:0.001:0.01:0.05:0.0: Graphite_Layer_Sep:0.001:0.01:0.05:0.0:
 Single_Atom:0.0001:0.001:0.05:0.0}
#0  0x1554e2f3ac1f in ???
#1  0x1554e42bddf7 in ???
#2  0xbb5509 in ???
#3  0xbb5044 in ???
#4  0xbb4ce5 in ???
#5  0xa9e0d8 in ???
#6  0x41aa80 in ???
#7  0x40a292 in ???
#8  0x409b1e in ???
#9  0x1554e2b86492 in ???
#10  0x409b5d in ???
#11  0xffffffffffffffff in ???
./gap_fitMPI.sh: line 63: 1568307 Segmentation fault      (core dumped) /home/QUIPMPI/QUIP/build/linux
_x86_64_gfortran_openmpi+openmp/gap_fit atoms_filename=Cours3909_238.xyz at_file=Cours3909_238.xyz gap = {dista
nce_2b n_sparse=15 theta_uniform=1.0 sparse_method=uniform covariance_type=ard_se cutoff=6.0 delta=2.0 : angle_
3b n_sparse=200 theta_uniform=1.0 sparse_method=uniform covariance_type=ard_se cutoff=2.5 delta=0.05 : soap n_m
ax=12 l_max=4 atom_sigma=0.5 zeta=4.0 cutoff=6.0 cutoff_transition_width=1.0 central_weight=1.0 n_sparse=9000 d
elta=0.2 covariance_type=dot_product sparse_method=cur_points radial_decay=-0.5} default_sigma={0.001 0.01 0.05
 0.0} default_kernel_regularisation={0.001 0.01 0.05 0.0} energy_parameter_name=energy force_parameter_name=for
ces virial_parameter_name=virial do_copy_at_file=F sparse_jitter=1.0e-8 sparsify_only_no_fit=F sparse_separate_
file=T openmp_chunk_size=10000 gp_file=Cours238.xml core_ip_args={IP Glue} core_param_file=r6_innercut.xml conf
ig_type_sigma={Liquid:0.050:0.5:0.5:0.0: Liquid_Interface:0.050:0.5:0.5:0.0: Amorphous_Bulk:0.005:0.2:0.2:0.0: 
Amorphous_Surfaces:0.005:0.2:0.2:0.0: Surfaces:0.002:0.1:0.2:0.0: Dimer:0.002:0.1:0.2:0.0: Fullerenes:0.002:0.1
:0.2:0.0: Defects:0.001:0.01:0.05:0.0: Crystalline_Bulk:0.001:0.01:0.05:0.0: Nanotubes:0.001:0.01:0.05:0.0: Gra
phite:0.001:0.01:0.05:0.0: Diamond:0.001:0.01:0.05:0.0: Graphene:0.001:0.01:0.05:0.0: Graphite_Layer_Sep:0.001:
0.01:0.05:0.0: Single_Atom:0.0001:0.001:0.05:0.0}

Here is the *.out file.

libAtoms::Hello World: 2024-04-09 22:25:43
libAtoms::Hello World: git version  https://github.com/libAtoms/QUIP.git,v0.9.14-9-gcaff15489-dirty
libAtoms::Hello World: QUIP_ARCH    linux_x86_64_gfortran_openmpi+openmp
libAtoms::Hello World: compiled on  Mar 27 2024 at 11:08:38
libAtoms::Hello World: MPI parallelisation with 128 processes
libAtoms::Hello World: OpenMP parallelisation with 64 threads
libAtoms::Hello World: OMP_STACKSIZE=1G
libAtoms::Hello World: MPI run with the same seed on each process
libAtoms::Hello World: Random Seed = 1745256961
libAtoms::Hello World: global verbosity = 0

Calls to system_timer will do nothing by default
albapa commented 5 months ago

The standard output does not report the parsing of the command line - I still suspect a problem there, otherwise it would stop at a later stage.

MES-physics commented 5 months ago

Ok, now what I did is go back to a sort of square-one, using Deringer's old command line in the 2017 paper, and putting it in a config_file. Attached is the config_file (changing only the input .xyz file and .xml name), the error file, and the out file. Same type of segmentation error happened. My previous trial was with a different .xyz file and command line, but the same error. Please help?

Here is my run command :
mpirun -np 128 /home/myname/QUIPMPI/QUIP/build/linux_x86_64_gfortran_openmpi+openmp/gap_fit config_file=CDerConfig

CDerConfig.txt

CDerAmorphParam-1759026out.txt

CDerAmorphParam-1759026err.txt

bernstei commented 5 months ago

Does the crash generate a core file? If not, is it because your shell is limiting it, and you can turn that limit off? We may be able to figure out where it's crashing at least.

albapa commented 5 months ago

The error file shows as if it was crashing in the Scalapack initialisation. Could it be an incompatibility between the mpi and the scalapack?

MES-physics commented 5 months ago

I don't know how to turn the limit off? I guess I can look it up. And if it is an incompatibility beween scalapack and MPI, how do I ask my admin how to fix it? Thanks! I have these in the SLURM script:

ulimit -s unlimited
export PYTHONUNBUFFERED=TRUE
export OMP_STACKSIZE=1G
export OMP_DYNAMIC=false
export OMP_PROC_BIND=true
export OMP_NUM_THREADS=64

I don't see any other output files than the ones I posted above. Thanks!

bernstei commented 5 months ago

That ulimit call should do it, but it's possible that it isn't propagated to the executables MPI actually runs. But if @albapa is right about the scalpack init, it may well be library compatibility. How did you compile?

MES-physics commented 5 months ago

I used the linux_x86_64_gfortran_openmpi+openmp architecture and these modules which I also use to run. module load gnu10 openmpi openblas module load netcdf-c netcdf-fortran module load scalapack

MES-physics commented 5 months ago

Also in the make config instructions, I did add netcdf-c support.

bernstei commented 5 months ago

Did you enable scalapack in "make config"? How does it know where to get your scalapack libraries? Did you add them to the math libraries when you ran "make config".

Can you upload your Makefile.inc?

Can you post the output of ldd path_to_your_gap_fit_executable ?

MES-physics commented 5 months ago

Did you enable scalapack in "make config"? YES I'm sure I did that. Add scalapack libraries to math libraries? Not sure, maybe not, probably hit the default on that, and don't know how to do it.

ldd output: (OH OH, I see some things not found, but I did put lopenblas and netcdf in the make config questions, which my admin told me.)

ldd /home/myname/QUIPMPI/QUIP/build/linux_x86_64_gfortran_openmpi+openmp/gap_fit
    linux-vdso.so.1 (0x00007ffdd2359000)
    libnetcdf.so.19 => not found
    libopenblas.so.0 => not found
    libmpi_usempif08.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_usempif08.so.40 (0x00007f65a2a7c000)
    libmpi_usempi_ignore_tkr.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007f65a286d000)
    libmpi_mpifh.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_mpifh.so.40 (0x00007f65a25ff000)
    libmpi.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi.so.40 (0x00007f65a22c7000)
    libgfortran.so.5 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libgfortran.so.5 (0x00007f65a1e0f000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f65a1a8d000)
    libmvec.so.1 => /lib64/libmvec.so.1 (0x00007f65a1862000)
    libgomp.so.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libgomp.so.1 (0x00007f65a1623000)
    libgcc_s.so.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libgcc_s.so.1 (0x00007f65a140b000)
    libquadmath.so.0 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libquadmath.so.0 (0x00007f65a11c4000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f65a0fa4000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f65a0bdf000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f65a2cbd000)
    libopen-rte.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libopen-rte.so.40 (0x00007f65a0925000)
    libopen-pal.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libopen-pal.so.40 (0x00007f65a0674000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f65a0470000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f65a0268000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007f65a0064000)
    libz.so.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/zlib-1.2.11-2y/lib/libz.so.1 (0x00007f659fe4d000)
    libhwloc.so.15 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/hwloc-2.7.0-f7/lib/libhwloc.so.15 (0x00007f659fbf2000)
    libevent_core-2.1.so.6 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libevent-2.1.8-nd/lib/libevent_core-2.1.so.6 (0x00007f659f9bd000)
    libevent_pthreads-2.1.so.6 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libevent-2.1.8-nd/lib/libevent_pthreads-2.1.so.6 (0x00007f659f7ba000)
    libpciaccess.so.0 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libpciaccess-0.16-ro/lib/libpciaccess.so.0 (0x00007f659f5b1000)
    libxml2.so.2 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libxml2-2.9.12-lq/lib/libxml2.so.2 (0x00007f659f245000)
    libcrypto.so.1.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openssl-1.1.1m-ly/lib/libcrypto.so.1.1 (0x00007f659ed5c000)
    liblzma.so.5 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/xz-5.2.4-ok/lib/liblzma.so.5 (0x00007f659eb35000)
    libiconv.so.2 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libiconv-1.16-a7/lib/libiconv.so.2 (0x00007f659e839000)

Makefile.inc.txt

bernstei commented 5 months ago

There should be no way those things could be missing when it actually runs. Did you have all the same modules loaded when you ran the ldd as are loaded inside the running job?

scalapack must be linked statically, which is unfortunate, since it means you can't tell which one it's using from the ldd output.

Unfortunately, there's an infinite number of ways to set up mpi, scalapack, and lapack, and they need to be consistent.

If you make clean, then make again to recreate the executable and save all the output, there should be a link line for the gap_fit executable which includes all the libraries. You should consult with whoever set up scalapack and the environment modules, who should be able to look at that link line (possibly also the value of $LD_LIBRARY_PATH when you run make) and tell you whether it's correct.

MES-physics commented 5 months ago

OK, sorry, here is the ldd command after adding the modules. And I will follow the rest of your advice, thank you so much!

ldd /home/myname/QUIPMPI/QUIP/build/linux_x86_64_gfortran_openmpi+openmp/gap_fit
    linux-vdso.so.1 (0x00007f8086727000)
    libnetcdf.so.19 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0-openmpi-4.1.2/netcdf-c-4.8.1-l3/lib/libnetcdf.so.19 (0x00007f8086132000)
    libopenblas.so.0 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openblas-0.3.20-iq/lib/libopenblas.so.0 (0x00007f8083899000)
    libmpi_usempif08.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_usempif08.so.40 (0x00007f8083658000)
    libmpi_usempi_ignore_tkr.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_usempi_ignore_tkr.so.40 (0x00007f8083449000)
    libmpi_mpifh.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_mpifh.so.40 (0x00007f80831db000)
    libmpi.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi.so.40 (0x00007f8082ea3000)
    libgfortran.so.5 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libgfortran.so.5 (0x00007f80829eb000)
    libm.so.6 => /lib64/libm.so.6 (0x00007f8082669000)
    libmvec.so.1 => /lib64/libmvec.so.1 (0x00007f808243e000)
    libgomp.so.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libgomp.so.1 (0x00007f80821ff000)
    libgcc_s.so.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libgcc_s.so.1 (0x00007f8081fe7000)
    libquadmath.so.0 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libquadmath.so.0 (0x00007f8081da0000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f8081b80000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f80817bb000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f80864fb000)
    libpnetcdf.so.4 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0-openmpi-4.1.2/parallel-netcdf-1.12.1-t6/lib/libpnetcdf.so.4 (0x00007f8081001000)
    libhdf5_hl.so.200 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0-openmpi-4.1.2/hdf5-1.12.1-z2/lib/libhdf5_hl.so.200 (0x00007f8080de0000)
    libhdf5.so.200 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0-openmpi-4.1.2/hdf5-1.12.1-z2/lib/libhdf5.so.200 (0x00007f80807cd000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f80805c9000)
    libz.so.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/zlib-1.2.11-2y/lib/libz.so.1 (0x00007f80803b2000)
    libcurl.so.4 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/curl-7.80.0-rm/lib/libcurl.so.4 (0x00007f8080121000)
    libopen-rte.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libopen-rte.so.40 (0x00007f807fe67000)
    libopen-pal.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libopen-pal.so.40 (0x00007f807fbb6000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f807f9ae000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007f807f7aa000)
    libhwloc.so.15 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/hwloc-2.7.0-f7/lib/libhwloc.so.15 (0x00007f807f54f000)
    libevent_core-2.1.so.6 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libevent-2.1.8-nd/lib/libevent_core-2.1.so.6 (0x00007f807f31a000)
    libevent_pthreads-2.1.so.6 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libevent-2.1.8-nd/lib/libevent_pthreads-2.1.so.6 (0x00007f807f117000)
    libmpi_cxx.so.40 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openmpi-4.1.2-4a/lib/libmpi_cxx.so.40 (0x00007f807eefb000)
    libstdc++.so.6 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-9.3.0/gcc-10.3.0-ya/lib64/libstdc++.so.6 (0x00007f807eb27000)
    libssl.so.1.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openssl-1.1.1m-ly/lib/libssl.so.1.1 (0x00007f807e893000)
    libcrypto.so.1.1 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/openssl-1.1.1m-ly/lib/libcrypto.so.1.1 (0x00007f807e3aa000)
    libpciaccess.so.0 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libpciaccess-0.16-ro/lib/libpciaccess.so.0 (0x00007f807e1a1000)
    libxml2.so.2 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libxml2-2.9.12-lq/lib/libxml2.so.2 (0x00007f807de35000)
    liblzma.so.5 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/xz-5.2.4-ok/lib/liblzma.so.5 (0x00007f807dc0e000)
    libiconv.so.2 => /opt/sw/spack/apps/linux-rhel8-x86_64_v2/gcc-10.3.0/libiconv-1.16-a7/lib/libiconv.so.2 (0x00007f807d912000)
MES-physics commented 5 months ago

OK, thanks very much. We installed again with OneAPI and MKL instead. It turned out that the scalapack modules we have do not work on AMD nodes. It worked now to make a potential.