underworldcode / UWGeodynamics

Underworld Geodynamics
Other
80 stars 32 forks source link

UWGeodynamics 2.11 - passive tracers and checkpoint problem #252

Closed totaibi closed 2 years ago

totaibi commented 2 years ago

Hello UWGeodynamics team,

I ahve installed the UWGeodynamics on our HPC with the lastest version (2.11). Our code is running on the earlier versions of UWGeodynamics, however we got error messages with passive tracers and checkpoint_intervals as follow:

The command used for passive tracers:

#passive tracers

npoints = 2048 # This is the number of points used to define the surface

xx = np.linspace(GEO.nd(Model.minCoord[0]), GEO.nd(Model.maxCoord[0]), npoints)
yy = np.linspace(GEO.nd(Model.minCoord[1]), GEO.nd(Model.maxCoord[1]), npoints)
zz = 0.

xx,yy,zz = np.meshgrid(xx,yy,zz)
coords = np.ndarray((xx.size, 3))

coords[:,0] = xx.ravel()
coords[:,1] = yy.ravel()
coords[:,2] = zz.ravel()
surface_tracers = Model.add_passive_tracers(name="Surface", vertices=coords)

coords[:,2] = GEO.nd(-20.*u.km)
BD_tracers = Model.add_passive_tracers(name="BD", vertices=coords)

coords[:,2] = GEO.nd(-35.*u.km)
moho_tracers = Model.add_passive_tracers(name="Moho", vertices=coords)

coords[:,2] = GEO.nd(-90.*u.km)
litho_tracers = Model.add_passive_tracers(name="Litho", vertices=coords)

surface_tracers.add_tracked_field(Model.strainRateField,
                              name="surface_strainRate",
                              units=1.0/u.second, dataType="float")

BD_tracers.add_tracked_field(Model.strainRateField,
                              name="BD_strainRate",#
                              units=1.0/u.second, dataType="float")

moho_tracers.add_tracked_field(Model.strainRateField,
                              name="moho_strainRate",
                              units=1.0/u.second, dataType="float")

litho_tracers.add_tracked_field(Model.strainRateField,
                              name="LM_strainRate",
                              units=1.0/u.second, dataType="float")

the corresponding error:

Traceback (most recent call last):
  File "Harrat.py", line 447, in <module>
    surface_tracers.add_tracked_field(Model.strainRateField,
AttributeError: 'NoneType' object has no attribute 'add_tracked_field'

The command used for Checkpoint_interval:

Model.run_for(40.0 * u.megayear, checkpoint_interval=1. * u.megayear)

The corresponding error:

Traceback (most recent call last):
  File "Harrat.py", line 529, in <module>
    Model.run_for(40.0 * u.megayear, checkpoint_interval=1. * u.megayear)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 1613, in run_for
    checkpointer = _CheckpointFunction(
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2335, in __init__
    self.checkpoint_all()
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2415, in checkpoint_all
    self.checkpoint_swarms(variables, checkpointID, time, outputDir)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2566, in checkpoint_swarms
    string += _swarmvarschema(handle, field)
  File "/opt/venv/lib/python3.8/site-packages/underworld/utils/_utils.py", line 387, in _swarmvarschema
    h5f = h5py.File(name=varfilename, mode="r")
  File "/opt/venv/lib/python3.8/site-packages/h5py/_hl/files.py", line 444, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/opt/venv/lib/python3.8/site-packages/h5py/_hl/files.py", line 199, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 100, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

Do you have any suggestions?

Thanks, Thamer

bknight1 commented 2 years ago

Hi @totaibi,

I believe the issue is arising due to recent changes in how passive tracers are created and then accessed in UWGeodynamics.

Instead of: surface_tracers = Model.add_passive_tracers(name="Surface", vertices=coords) it should be: Model.add_passive_tracers(name="Surface", vertices=coords)

Then to track different fields it should be:

Model.Surface_tracers.add_tracked_field(Model.pressureField,
                                       name="tracers_press",
                                       units=u.megapascal,
                                       dataType="float")

This is outlined in the newest documentation (https://github.com/underworldcode/UWGeodynamics/blob/485ad28cee2b9373a2c9673587e49a96c2e14150/docs/readthedocs/src/UserGuide.rst)

So your code will look like this:

Model.add_passive_tracers(name="Surface", vertices=coords)

coords[:,2] = GEO.nd(-20.*u.km)
Model.add_passive_tracers(name="BD", vertices=coords)

coords[:,2] = GEO.nd(-35.*u.km)
Model.add_passive_tracers(name="Moho", vertices=coords)

coords[:,2] = GEO.nd(-90.*u.km)
Model.add_passive_tracers(name="Litho", vertices=coords)

Model.Surface_tracers.add_tracked_field(Model.strainRateField,
                              name="surface_strainRate",
                              units=1.0/u.second, dataType="float")

Model.BD_tracers.add_tracked_field(Model.strainRateField,
                              name="BD_strainRate",
                              units=1.0/u.second, dataType="float")

Model.Moho_tracers.add_tracked_field(Model.strainRateField,
                              name="moho_strainRate",
                              units=1.0/u.second, dataType="float")

Model.Litho_tracers.add_tracked_field(Model.strainRateField,
                              name="LM_strainRate",
                              units=1.0/u.second, dataType="float")

Let me know how you get on.

Cheers

totaibi commented 2 years ago

Hi @bknight1,

Thank you for your support. The passive tracers problem has been resolved based on your comment. I'm still getting the same message for checkpoint:

Traceback (most recent call last):
  File "Harrat.py", line 527, in <module>
    Model.run_for(40.0 * u.megayear, checkpoint_interval=1. * u.megayear)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 1613, in run_for
    checkpointer = _CheckpointFunction(
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2335, in __init__
    self.checkpoint_all()
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2416, in checkpoint_all
    self.checkpoint_tracers(tracers, checkpointID, time, outputDir)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2612, in checkpoint_tracers
    item.save(outputDir, checkpointID, time)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_utils.py", line 191, in save
    with h5py.File(name=swarm_fpath, mode="r") as h5f:
  File "/opt/venv/lib/python3.8/site-packages/h5py/_hl/files.py", line 444, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/opt/venv/lib/python3.8/site-packages/h5py/_hl/files.py", line 199, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 100, in h5py.h5f.open
OSError: Unable to open file (file signature not found)

Do you have any suggestion?

Thamer

bknight1 commented 2 years ago

Hi @totaibi

What version of UW and UWGeodynamics are you using? There might be an issue if you're using UW v2.11 and UWGeodynamics v2.10.x

totaibi commented 2 years ago

I installed the latest version of both packages week ago via docker

bknight1 commented 2 years ago

To make sure, you can check using the following steps on the HPC system.

module load singularity
singularity shell uwgeodynamics_latest.sif  (change to your singularity image name) - this will open the singularity container

python3 - this will load a python interactive shell

import UWGeodynamics as GEO
GEO.__version__ - this prints the UWGeodynamics version
import underworld as UW
UW.__version__ - this prints the Underworld version
totaibi commented 2 years ago

Thanks @bknight1

The UWGeodynamics version: 2.11.0-dev-485ad28(master)

The UW version: 2.11.0b

bknight1 commented 2 years ago

I am using the same version on my local machine through docker and it works fine for running 2D and 3D models and is able to save the passive tracers fine. I'd recommend trying the script on a local machine using the same version from docker to see if it is an issue with Singularity. If the same issue occurs when running via docker locally then the only thing I can think of otherwise is due to a duplicate variable name.

totaibi commented 2 years ago

Thanks again @bknight1 for your feedback and suggestion. I realized that either the file is corrupted or not in HDF5 format. Do you have any suggestions?

The loaded modules are singularity, python, and HDF5.

bknight1 commented 2 years ago

I'm not sure why the file is becoming corrupted @totaibi.

Did you try it on a local machine to see if you could recreate the issue?

totaibi commented 2 years ago

Both UWGeodynamics and UW (the same version as in the HPC) are working fine on my local machine

I thought we could be missing loading one of the modules!?

One additional note, our HPC nodes are mounted using LUSTRE with flock option

bknight1 commented 2 years ago

Okay sounds like a singularity issue then.... I don't have much experience using singularity unfortunately, I'll have to pass the issue on to @julesghub or @rbeucher

totaibi commented 2 years ago

Thanks @bknight1

Here is the error message for corrupted file:

Traceback (most recent call last):
  File "Harrat.py", line 526, in <module>
    Model.run_for(40.0*u.megayears, checkpoint_interval=1.0*u.megayears)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 1613, in run_for
    checkpointer = _CheckpointFunction(
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2335, in __init__
    self.checkpoint_all()
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2416, in checkpoint_all
    self.checkpoint_tracers(tracers, checkpointID, time, outputDir)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_model.py", line 2612, in checkpoint_tracers
    item.save(outputDir, checkpointID, time)
  File "/opt/venv/lib/python3.8/site-packages/UWGeodynamics/_utils.py", line 191, in save
    with h5py.File(name=swarm_fpath, mode="r") as h5f:
  File "/opt/venv/lib/python3.8/site-packages/h5py/_hl/files.py", line 444, in __init__
    fid = make_fid(name, mode, userblock_size,
  File "/opt/venv/lib/python3.8/site-packages/h5py/_hl/files.py", line 199, in make_fid
    fid = h5f.open(name, flags, fapl=fapl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5f.pyx", line 100, in h5py.h5f.open
OSError: Unable to open file (corrupt object header - incorrect # of messages)
totaibi commented 2 years ago

I want to following up. The latest version has been re-installed today on our HPC, unfortunately we still getting the same issue.

During the installation I got the following message, I thought it may guides us to the solution!

info unpack layer: sha256:c549ccf8d472c3bce9ce02e49c62b8f6cbc736ea2b8ba812a1ae9390c69d0b71
warn xattr{etc/gshadow} ignoring ENOTSUP on setxattr "user.rootlesscontainers"
warn xattr{/tmp/build-temp-473332333/rootfs/etc/gshadow} destination filesystem does not support xattrs, further warnings will be suppressed 

Any suggestions?

Thamer

bknight1 commented 2 years ago

I'm not sure what the error is unfortunately.

I had a thought though, where are you trying to store the data? e.g. what is the Model.outputDir directory? This may cause an issue if you can't access the path from within docker.

julesghub commented 2 years ago

This appears as a h5py issue when writing swarm variables to disk. Good suggestion @bknight1 about checking if the path outputDir is accessible?

totaibi commented 2 years ago

The path is accessible, we figured out the problem may caused by the parallel computing, as the code is running on a single node. We are using srun at this stage, we will keep you posted

Thanks @bknight1 @julesghub for your continued support

julesghub commented 2 years ago

Hi @totaibi, any news on this issue?

totaibi commented 2 years ago

Hi @julesghub, thanks for following up. The problem is still not solved yet. The code works fine on a single node, but it does not across nodes or racks due to the file signature problem as reported earlier.

The HPC supporting team have no idea about what causing this issue.

totaibi commented 2 years ago

The problem could be with the current version of UWGeodynamics. How to install an earlier version? I need the key name that I can use with docker pull command

julesghub commented 2 years ago

Hi @totaibi, To get a previous version of the UWGeodynamics docker (singularity) use. docker pull underworldcode/uwgeodynamics:v2.10.2 see all available dockers with this link https://hub.docker.com/repository/registry-1.docker.io/underworldcode/uwgeodynamics/tags?page=1&ordering=last_updated&name=v2

I understand you're using Singularity on a HPC? Which HPC? Do you have a link? As previously mentioned the Singularity + h5py combination seems to be the problem. A filesystem issue that I've not seen before. A likely workaround around would be to install the code "bare metal" on the HPC. I can help out with that if you like. But singularity is preferred.

totaibi commented 2 years ago

Hi @julesghub,

Thank you for following up

I used an earlier version and got the same issue.

We are using Singularity with SANAM HPC, which is belongs to KACST (Saudi research institution). Unfortunately I could not find any link for it. The HPC is in the expansion stage, so I may raise another issue to setup the bare metal setting on it

Thank you again and the rest of UWGeodynamics team

Regards