Closed zhf-0 closed 2 years ago
It appears to me that you are just running out of memory (your system stack or physical memory). With the crashing setting you are factorizing a kernel matrix of dimension up to 60*120=7200. What kind of machine do you have? How many cores do you physically have? How much memory do your system have? Also keep in mind that:
BTW, -nrun is the total of samples per task, they're not all randomly generated. By default, half of the nrun samples are sequentially generated by searching the acquisition function.
I am using PC to train the model, which consists of AMD 3900X, 12 cores, 24 threads, 48G RAM. Thank you for your suggestions, I am considering using a cluster to re-train the model.
Great. In case you don't know, the newer gptune interfaces are available in the latest gptune commit. You can look at the readme page, or the user guide for more information: https://github.com/gptune/GPTune/blob/master/Doc/GPTune_UsersGuide.pdf
Thank you for your notification, I was busy transferring my data and re-building the gptune lib in the cluster, then just got this message by now. Since the interfaces have changed, Is it ok that I run my old python scripts based on the new gptune?
BTW, building gptune from scatch is not a esay task, especially in a cluster which can not access to the internet. The version of gcc is too high and dependence is too obnoxious to deal with. Maybe you can consider packing all the python packages and c/c++ libraries using conda. conda is a package management tool that can create virtual environment just like virtualenv in python, but this virtual env can also manage c/c++ libraries and bins. Those features are very suitable for a package like gptune that includes python and c/c++. Comparing to docker, conda is very light-weighted to run and control. With the help of conda, gptune can provide out-of-the-box functionality and be downloaded just by one command conda install gptune
Yes, you need to make a few minor changes to your old python script. Specially, assuming your old script is located at DIR, then in DIR you need create a folder .gptune, where you need create a file meta.json that defines the application name, and machine/software information, you can copy one from the many example folders GPTune provides and modify it for your purpose. Also, you need create a folder gptune_db (empty folder should be good enough, this will be used to store your function samples). Then in your python script, you can call (machine, processor, nodes, cores) = GetMachineConfiguration() to read the .gptune/meta.json file.
The suggestion about installation is very helpful. We are aware of the difficulty of installing GPTune with all dependencies correctly. Moving forward, we are thinking of using something like ck/spack/ for easier installation. conda is also an option, we will keep you posted on this.
Sorry to bother you, I encountered some mistakes when using GPTune in the cluster. Just for keeping the same version of code between my PC and cluster, I installed the old version of GPTune in the cluster. I installed openBLAS-0.3.10
, openmpi-4.0.4
and scalapack-2.1.0
following the script config_cleanlinux.sh
in the corrent version, but installed the C part of GPTune
following the old manual. The whole compiling process is fine.
When I test the python demo.py
command, the error messages are
Traceback (most recent call last):
File "demo.py", line 32, in <module>
from gptune import GPTune
File "/public/home/li20N04/testsoft/GPTune/GPTune/gptune.py", line 29, in <module>
from model import *
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 182, in <module>
from lcm import LCM
File "/public/home/li20N04/testsoft/GPTune/GPTune/lcm.py", line 34, in <module>
cliblcm = ctypes.cdll.LoadLibrary(ROOTDIR + '/lib_gptuneclcm.so')
File "/public/home/li20N04/software/miniconda/install/envs/work/lib/python3.8/ctypes/__init__.py", line 459, in LoadLibrary
return self._dlltype(name)
File "/public/home/li20N04/software/miniconda/install/envs/work/lib/python3.8/ctypes/__init__.py", line 381, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libmpi_usempi.so.40: cannot open shared object file: No such file or directory
the libraries in the $path/to/openmpi/lib
is (I already add the openmpi lib path to LD_LIBRARY_PATH
)
libmpi.a libmpi_usempif08.so libompitrace.so.40 mpi_ext.mod
libmpi.la libmpi_usempif08.so.40 libompitrace.so.40.20.0 mpi_f08_callbacks.mod
libmpi_mpifh.a libmpi_usempif08.so.40.21.1 libopen-pal.a mpi_f08_ext.mod
libmpi_mpifh.la libmpi_usempi_ignore_tkr.a libopen-pal.la mpi_f08_interfaces_callbacks.mod
libmpi_mpifh.so libmpi_usempi_ignore_tkr.la libopen-pal.so mpi_f08_interfaces.mod
libmpi_mpifh.so.40 libmpi_usempi_ignore_tkr.so libopen-pal.so.40 mpi_f08.mod
libmpi_mpifh.so.40.20.2 libmpi_usempi_ignore_tkr.so.40 libopen-pal.so.40.20.4 mpi_f08_types.mod
libmpi.so libmpi_usempi_ignore_tkr.so.40.20.0 libopen-rte.a mpi.mod
libmpi.so.40 libmpi_usempi.so.40 libopen-rte.la openmpi
libmpi.so.40.20.4 libompitrace.a libopen-rte.so pkgconfig
libmpi_usempif08.a libompitrace.la libopen-rte.so.40 pmpi_f08_interfaces.mod
libmpi_usempif08.la libompitrace.so libopen-rte.so.40.20.4
there is no libmpi_usempi.so.40
at all. Then I think maybe it's a soft link to libmpi_usempi_ignore_tkr.so.40.20.0
, just like libmpi_usempi_ignore_tkr.so
and libmpi_usempi_ignore_tkr.so.40
which are the same soft link. So I use the command
ln -s libmpi_usempi_ignore_tkr.so.40.20.0 libmpi_usempi.so.40
to create the libmpi_usempi.so.40
Again, running python demo.py
, and the error messages are
[c3858:220351] *** An error occurred in MPI_Init_thread
[c3858:220351] *** reported by process [1025966082,0]
[c3858:220351] *** on a NULL communicator
[c3858:220351] *** Unknown error
[c3858:220351] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c3858:220351] *** and potentially your MPI job)
Traceback (most recent call last):
File "demo.py", line 244, in <module>
main()
File "demo.py", line 205, in main
(data, modeler, stats) = gt.MLA(NS=NS, Igiven=giventask, NI=NI, NS1=int(NS/2))
File "/public/home/li20N04/testsoft/GPTune/GPTune/gptune.py", line 224, in MLA
modelers[o].train(data = tmpdata, **kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 188, in train
self.train_mpi(data, i_am_manager = True, restart_iters=list(range(kwargs['model_restarts'])), **kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 235, in train_mpi
res = list(map(fun, restart_iters))
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 234, in fun
return kern.train_kernel(X = data.P, Y = data.O, computer = self.computer, kwargs = kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/lcm.py", line 201, in train_kernel
mpi_comm = computer.spawn(__file__, nproc=mpi_size, nthreads=kwargs['model_threads'], npernode=npernode, kwargs = kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/computer.py", line 198, in spawn
comm = MPI.COMM_SELF.Spawn(sys.executable, args=executable, maxprocs=nproc,info=info)#, info=mpi_info).Merge()# process_rank = comm.Get_rank()
File "mpi4py/MPI/Comm.pyx", line 1931, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
Is my first step, creating a soft link, wrong?
I don't think you need creating a soft link. Something is wrong there. To quickly double check, what does ldd lib_gptuneclcm.so gives you? From the results, do those libmpi* exist in $path/to/openmpi/lib? If yes, then you build is just fine, and you can then check runtime issues. One thing to make sure is that does "which mpirun" also show the correct mpi version? You might just be using a wrong MPI version to run the test.
Thank you for your suggestions. I re-compile openmpi, saclapack and GPTune, and check the output of ldd *.so
to make sure all shared libraries with libmpi as prefix point to the right positions.
which mpirun
in the PBS script return to the right path.
But when I run python demo.py
, the error is the same
[c3858:121105] *** An error occurred in MPI_Init_thread
[c3858:121105] *** reported by process [3110010882,0]
[c3858:121105] *** on a NULL communicator
[c3858:121105] *** Unknown error
[c3858:121105] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[c3858:121105] *** and potentially your MPI job)
Traceback (most recent call last):
File "demo.py", line 244, in <module>
main()
File "demo.py", line 205, in main
(data, modeler, stats) = gt.MLA(NS=NS, Igiven=giventask, NI=NI, NS1=int(NS/2))
File "/public/home/li20N04/testsoft/GPTune/GPTune/gptune.py", line 224, in MLA
modelers[o].train(data = tmpdata, **kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 188, in train
self.train_mpi(data, i_am_manager = True, restart_iters=list(range(kwargs['model_restarts'])), **kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 235, in train_mpi
res = list(map(fun, restart_iters))
File "/public/home/li20N04/testsoft/GPTune/GPTune/model.py", line 234, in fun
return kern.train_kernel(X = data.P, Y = data.O, computer = self.computer, kwargs = kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/lcm.py", line 201, in train_kernel
mpi_comm = computer.spawn(__file__, nproc=mpi_size, nthreads=kwargs['model_threads'], npernode=npernode, kwargs = kwargs)
File "/public/home/li20N04/testsoft/GPTune/GPTune/computer.py", line 198, in spawn
comm = MPI.COMM_SELF.Spawn(sys.executable, args=executable, maxprocs=nproc,info=info)#, info=mpi_info).Merge()# process_rank = comm.Get_rank()
File "mpi4py/MPI/Comm.pyx", line 1931, in mpi4py.MPI.Intracomm.Spawn
mpi4py.MPI.Exception: MPI_ERR_OTHER: known error not in list
Another thing to make sure is that your mpi4py is also using the correct mpi version. You can try ldd $GPTuneROOT/mpi4py/build/lib.linux-x86_64-3.7/mpi4py/MPI.cpython-37m-x86_64-linux-gnu.so To see if it's using the correct MPI lib.
If yes, it seems that your compiling is ok. Still, at runtime it shouldn't be looking for libmpi_usempi.so.40 as I believe it's from a different MPI version than 4.0.4. There are a few more things to make sure at runtime. Particularly, make sure the openmpi and mpi4py paths precedes the existing ones, so that the correct ones are picked first.
export LD_LIBRARY_PATH=$path/to/openmpi/lib:$LD_LIBRARY_PATH export PATH=$path/to/openmpi/bin:$PATH export PYTHONPATH=your_python_path/lib/python3.7/site-packages:$PYTHONPATH export PYTHONPATH=$GPTUNEROOT/mpi4py/:$PYTHONPATH
If this still doesn't work, I would just suggest you using our docker image. See https://gptune.lbl.gov/documentation/gptune-tutorial-ecp2021/ page 71 for the instructions. If your application uses only one compute node, you can just copy your application into the docker image with "docker cp" and proceed from there.
Sorry to bother you, but the error messages are the same. I have updated the RAM from 48G to 128G, and also installed the single-node GPTune in the cluster successfully with the help of docker. Besides, I followed your suggestions before running the command
python MLA_loaddata.py -nodes 1 -cores 8 -nrun 60
# or in the cluster
mpirun -n 1 python MLA_loaddata.py -nodes 1 -cores 8 -nrun 60
nevertheless, the error messages are the same
[desktop:1824777] *** Process received signal ***
[desktop:1824777] Signal: Segmentation fault (11)
[desktop:1824777] Signal code: Address not mapped (1)
[desktop:1824777] Failing at address: (nil)
...
and both errors(PC and cluster) happened in the same position: the program just finished the 20th MLA iteration, and began to run the 21th iteraion where it failed.
I checked the memory usage during the executing, 128G RAM should be enough and each node in the cluster has 256G RAM.
BTW, does GPTune support online learning? Theoretically, GPTune is based on the Gaussian Process, and Gaussian Process can be derived from Bayesian Statistics, which should support online learning naturally. So, I load the pkl file from trained model, trying to improve the model's generality by new training set. After I modify the MLA_loaddata.py file, I realize it's not as easy as I think. Is there a solution for this?
Is it possible you share some files so that I can reproduce this error. It does sound like some sort of bug other than memory crash. You can send it to my email liuyangzhuan@lbl.gov if not here.
GPTune supports online learning naturally, but I don't know how you modified your script. I will have to see it and understand what you are trying to do.
I'm not sure why you are using the older version of GPTune, I can help you set up the problem using latest versions as well.
The error dosen't happen when the training set and nrun
is small. Now my training set includes 93 matrixes and the average file size is 200M. When -nrum = 40
, everything is fine and it will take 45 hours to train. So, it is not easy to reproduce the error. According to your suggestions, I have two ideas
nrun
to reproduce the error with small effort;As for the online learning part, I will send the script to you after adding comments and deleting unrelated codes.
Hi, when I was using gptune to train a model with 120 tasks, the program received signals and exit. The command used to run the program is
The error message is
I tried to change the number of cores from
8
to12
, but the error message is the same. But when I change-run 60
to-run 40
, which means each task will sample 40 times randomly, the program works fine. It is very hard to debug since the error only occur after running a long time. The libraries I use to build gptune areThank you very much.