Open Sougata18 opened 1 year ago
Yes, unfortunately it looks that way. Could you try to add --loglevel trace
to see where it gets stuck, then post the output here?
Also, could you post the output of pip freeze
?
Yes, unfortunately it looks that way. Could you try to add
--loglevel trace
to see where it gets stuck, then post the output here?
I gave the run again. Its now working and generating output files as well
Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.
Also, could you post the output of
pip freeze
?
backports.ssl-match-hostname==3.5.0.1 blivet==0.61.15.72 Brlapi==0.6.0 chardet==2.2.1 configobj==4.7.2 configshell-fb==1.1.23 coverage==3.6b3 cupshelpers==1.0 decorator==3.4.0 di==0.3 dnspython==1.12.0 enum34==1.0.4 ethtool==0.8 fail2ban==0.11.2 firstboot==19.5 fros==1.0 futures==3.1.1 gssapi==1.2.0 idna==2.4 iniparse==0.4 initial-setup==0.3.9.43 ipaddress==1.0.16 IPy==0.75 javapackages==1.0.0 kitchen==1.1.1 kmod==0.1 langtable==0.0.31 lxml==3.2.1 mysql-connector-python==1.1.6 netaddr==0.7.5 netifaces==0.10.4 nose==1.3.7 ntplib==0.3.2 numpy==1.16.6 ofed-le-utils==1.0.3 pandas==0.24.2 perf==0.1 policycoreutils-default-encoding==0.1 pyasn1==0.1.9 pyasn1-modules==0.0.8 pycups==1.9.63 pycurl==7.19.0 pygobject==3.22.0 pygpgme==0.3 pygraphviz==1.6.dev0 pyinotify==0.9.4 pykickstart==1.99.66.19 pyliblzma==0.5.3 pyparsing==1.5.6 pyparted==3.9 pysmbc==1.0.13 python-augeas==0.5.0 python-dateutil==2.8.1 python-ldap==2.4.15 python-linux-procfs==0.4.9 python-meh==0.25.2 python-nss==0.16.0 python-yubico==1.2.3 pytoml==0.1.14 pytz==2016.10 pyudev==0.15 pyusb==1.0.0b1 pyxattr==0.5.1 PyYAML==3.10 qrcode==5.0.1 registries==0.1 requests==2.6.0 rtslib-fb==2.1.63 schedutils==0.4 scikit-learn==0.20.4 scipy==1.2.3 seobject==0.1 sepolicy==1.1 setroubleshoot==1.1 six==1.9.0 sklearn==0.0 slip==0.4.0 slip.dbus==0.4.0 SSSDConfig==1.16.2 subprocess32==3.2.6 targetcli-fb===2.1.fb46 torch==1.4.0 urlgrabber==3.10 urllib3==1.10.2 urwid==1.1.1 yum-langpacks==0.4.2 yum-metadata-parser==1.1.4
what does this line mean? "export OMP_NUM_THREADS=1" When I am setting the number to 8, its showing some MPI error.
Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.
Veros should print progress updates even when using MPI. They may just be drowned out by the trace output. You can switch back to normal verbosity if things are working now.
what does this line mean? "export OMP_NUM_THREADS=1" When I am setting the number to 8, its showing some MPI error.
This sets the number of threads used by some packages we rely on (like the SciPy solvers). Since you are using MPI for multiprocessing you shouldn't use more than 1 thread per processor.
Thanks!
Had one doubt regarding the model run status -
Current iteration: 3706 (0.71/1.00y | 42.9% | 4.78h/(model year) | 1.4h left)
what does "4.78h/(model year)" mean ?
Every simulated year takes 4.78 hours of real time.
Here I'm trying to use 256 cores ( 16 nodes and 16 tasks per node )
but an error is presisting : ValueError: processes do not divide domain evenly in x-direction
360 (number of grid cells) isn't divisible by 16 (number of processors).
My run got stuck for some reason but the restart file was there which has around 1.5 years of data; when I again gave the run it should start from the point where it last wrote in the restart file, right ? but the model is running from the initial time, i.e., 0th year.
I used this code for the model run:
With veros resubmit
, you can only restart from completed runs. If frequent crashes are a problem I would recommend to shorten the length of a run to 1 year or so, and schedule more of them.
This MPI error is presisting. Can you please check ?
I can't debug this without further information. Please dump a logfile with --loglevel trace
and ideally also export MPI4JAX_DEBUG=1
.
I also suggest you get in contact with your cluster support about how MPI should be called. For example, whether --mpi=pmi2
is the correct flag.
Sure! I'll ask the admin once.
Here's both the output and error files.
output_and_error.zip
Thanks, this is useful. Looks like this may be a problem on our end with mismatched MPI calls. I'll keep looking.
Actually it looks to me like the MPI calls are correctly matched, it's just that one rank (r33
) stops responding for some reason. Unfortunately this will be almost impossible to debug for me. I suggest you talk to your cluster support about it. In the meantime, here are some things you could try as a workaround:
cn
and some called gpu
; mixing different architectures may be a cause here)Hope that helps.
Thanks! I will try these and let you know.
I am trying the veros run with MPI + JAX; While the model run is going on, no output files are generating as well as the progress is not showing in the stdout file. Is the model run stuck ?