team-ocean / veros

The versatile ocean simulator, in pure Python, powered by JAX.
https://veros.readthedocs.io
MIT License
330 stars 55 forks source link

No output files generating while using mpirun with JAX #477

Open Sougata18 opened 1 year ago

Sougata18 commented 1 year ago

image I am trying the veros run with MPI + JAX; While the model run is going on, no output files are generating as well as the progress is not showing in the stdout file. Is the model run stuck ?

dionhaefner commented 1 year ago

Yes, unfortunately it looks that way. Could you try to add --loglevel trace to see where it gets stuck, then post the output here?

dionhaefner commented 1 year ago

Also, could you post the output of pip freeze?

Sougata18 commented 1 year ago

Yes, unfortunately it looks that way. Could you try to add --loglevel trace to see where it gets stuck, then post the output here?

I gave the run again. Its now working and generating output files as well image

Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.

Sougata18 commented 1 year ago

Also, could you post the output of pip freeze?

backports.ssl-match-hostname==3.5.0.1 blivet==0.61.15.72 Brlapi==0.6.0 chardet==2.2.1 configobj==4.7.2 configshell-fb==1.1.23 coverage==3.6b3 cupshelpers==1.0 decorator==3.4.0 di==0.3 dnspython==1.12.0 enum34==1.0.4 ethtool==0.8 fail2ban==0.11.2 firstboot==19.5 fros==1.0 futures==3.1.1 gssapi==1.2.0 idna==2.4 iniparse==0.4 initial-setup==0.3.9.43 ipaddress==1.0.16 IPy==0.75 javapackages==1.0.0 kitchen==1.1.1 kmod==0.1 langtable==0.0.31 lxml==3.2.1 mysql-connector-python==1.1.6 netaddr==0.7.5 netifaces==0.10.4 nose==1.3.7 ntplib==0.3.2 numpy==1.16.6 ofed-le-utils==1.0.3 pandas==0.24.2 perf==0.1 policycoreutils-default-encoding==0.1 pyasn1==0.1.9 pyasn1-modules==0.0.8 pycups==1.9.63 pycurl==7.19.0 pygobject==3.22.0 pygpgme==0.3 pygraphviz==1.6.dev0 pyinotify==0.9.4 pykickstart==1.99.66.19 pyliblzma==0.5.3 pyparsing==1.5.6 pyparted==3.9 pysmbc==1.0.13 python-augeas==0.5.0 python-dateutil==2.8.1 python-ldap==2.4.15 python-linux-procfs==0.4.9 python-meh==0.25.2 python-nss==0.16.0 python-yubico==1.2.3 pytoml==0.1.14 pytz==2016.10 pyudev==0.15 pyusb==1.0.0b1 pyxattr==0.5.1 PyYAML==3.10 qrcode==5.0.1 registries==0.1 requests==2.6.0 rtslib-fb==2.1.63 schedutils==0.4 scikit-learn==0.20.4 scipy==1.2.3 seobject==0.1 sepolicy==1.1 setroubleshoot==1.1 six==1.9.0 sklearn==0.0 slip==0.4.0 slip.dbus==0.4.0 SSSDConfig==1.16.2 subprocess32==3.2.6 targetcli-fb===2.1.fb46 torch==1.4.0 urlgrabber==3.10 urllib3==1.10.2 urwid==1.1.1 yum-langpacks==0.4.2 yum-metadata-parser==1.1.4

Sougata18 commented 1 year ago

what does this line mean? "export OMP_NUM_THREADS=1" When I am setting the number to 8, its showing some MPI error.

dionhaefner commented 1 year ago

Another question I had, how to get the expected time of completion or the progress of model run while using MPI? Cause when I am not using mpi it shows the model progress as well as expected time of completion.

Veros should print progress updates even when using MPI. They may just be drowned out by the trace output. You can switch back to normal verbosity if things are working now.

what does this line mean? "export OMP_NUM_THREADS=1" When I am setting the number to 8, its showing some MPI error.

This sets the number of threads used by some packages we rely on (like the SciPy solvers). Since you are using MPI for multiprocessing you shouldn't use more than 1 thread per processor.

Sougata18 commented 1 year ago

Thanks! Had one doubt regarding the model run status -
Current iteration: 3706 (0.71/1.00y | 42.9% | 4.78h/(model year) | 1.4h left) what does "4.78h/(model year)" mean ?

dionhaefner commented 1 year ago

Every simulated year takes 4.78 hours of real time.

Sougata18 commented 1 year ago

Here I'm trying to use 256 cores ( 16 nodes and 16 tasks per node ) image

but an error is presisting : ValueError: processes do not divide domain evenly in x-direction

dionhaefner commented 1 year ago

360 (number of grid cells) isn't divisible by 16 (number of processors).

Sougata18 commented 1 year ago

My run got stuck for some reason but the restart file was there which has around 1.5 years of data; when I again gave the run it should start from the point where it last wrote in the restart file, right ? but the model is running from the initial time, i.e., 0th year. image

I used this code for the model run: image

dionhaefner commented 1 year ago

With veros resubmit, you can only restart from completed runs. If frequent crashes are a problem I would recommend to shorten the length of a run to 1 year or so, and schedule more of them.

Sougata18 commented 1 year ago

image This MPI error is presisting. Can you please check ?

dionhaefner commented 1 year ago

I can't debug this without further information. Please dump a logfile with --loglevel trace and ideally also export MPI4JAX_DEBUG=1.

I also suggest you get in contact with your cluster support about how MPI should be called. For example, whether --mpi=pmi2 is the correct flag.

Sougata18 commented 1 year ago

Sure! I'll ask the admin once. Here's both the output and error files.
output_and_error.zip

dionhaefner commented 1 year ago

Thanks, this is useful. Looks like this may be a problem on our end with mismatched MPI calls. I'll keep looking.

dionhaefner commented 1 year ago

Actually it looks to me like the MPI calls are correctly matched, it's just that one rank (r33) stops responding for some reason. Unfortunately this will be almost impossible to debug for me. I suggest you talk to your cluster support about it. In the meantime, here are some things you could try as a workaround:

Hope that helps.

Sougata18 commented 1 year ago

Thanks! I will try these and let you know.