Error in running high resolution 3D spherical models using regionalMesh branch in Raijin

gthyagi commented 5 years ago

Hi @jmansour @julesghub , I have successfully ran low resolution 3D spherical models in Raijin. Following images is created with (256, 256, 128) resolution in (lon, lat, rad) directions.

When I increase the resolution to (384, 512, 256) or (512, 512, 256) or (768, 1024, 256) produces this error:

Error running Underworld - Signal 15 'SIGTERM' (Termination Request).
This is caused by an external call to terminate the code.
This could have happened by a queueing system (e.g. if the code has run longer than allowed),
the code might have been killed on another processor or it may have been killed by the user.

Note: I am loading the h5 files of indexed swarm coordinates and material variable instead of creating them on cluster.

Thanks.

julesghub commented 5 years ago

Not much to go on with that general message. Could you send the preceding part of the output? We need to determine at what stage this error occurred e.g. swarm load, solve.

gthyagi commented 5 years ago

@julesghub, I am still performing few more tests with high resolution models. Soon i will provide detailed statistics. At present I have submitted a job at (384, 512, 256) resolution requesting 1024 cpu and 2.0tb memory. In this case job exceeded the walltime (4hrs). For (256, 256, 128) resolution model took 1hr 14 min. From this I made a wild guess that 4hr should be efficient to run (384, 512, 256) resolution model with 1024 cpu and 2.0tb memory.

gthyagi commented 5 years ago

Not much to go on with that general message. Could you send the preceding part of the output? We need to determine at what stage this error occurred e.g. swarm load, solve.

output file contains walltime exceeded message and above general error.

julesghub commented 5 years ago

Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs.

There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansour, thoughts?

gthyagi commented 5 years ago

@julesghub My high resolution models workflow:

First I created and saved swarm coordinates and material variable.
Then I use matlab script to index the swarm coordinates in serial.
Finally swarm coordinates and material variable files are loaded to cluster to solve for the stokes system. Suspected steps where time is wasted enormously:
Loading swarm coordinates and material variable Q : Is there a size limit on the uploaded h5 files of the swarm coordinate and material variable ?

gthyagi commented 5 years ago

Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs.

There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansour, thoughts?

Yes. some stats on the file sizes I am uploading

(384, 512, 256) res model has swarm size 23gb and matVar size 3.8gb
(512, 512, 256) res model has swarm size 30gb and matVar size 10gb
(768, 1024, 256) res model has swarm size 90gb and matVar size 30gb

gthyagi commented 5 years ago

@julesghub My high resolution models workflow:

First I created and saved swarm coordinates and material variable.

Then I use matlab script to index the swarm coordinates in serial.

Finally swarm coordinates and material variable files are loaded to cluster to solve for the stokes system. Suspected steps where time is wasted enormously:

Loading swarm coordinates and material variable Q : Is there a size limit on the uploaded h5 files of the swarm coordinate and material variable ?

When I file is upload it is distributed over all processor or uploaded to a single processor?

gthyagi commented 5 years ago

Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs. There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansour, thoughts?

Yes. some stats on the file sizes I am uploading

(384, 512, 256) res model has swarm size 23gb and matVar size 3.8gb

(512, 512, 256) res model has swarm size 30gb and matVar size 10gb

(768, 1024, 256) res model has swarm size 90gb and matVar size 30gb

current working case stats (256, 256, 128) res model has swarm size 3.8gb and matVar size 640mb

lmoresi commented 5 years ago

We can develop a parallel safe kd tree it’s pretty lightweights

Prof Louis Moresi

louis.moresi@unimelb.edu.aumailto:louis.moresi@unimelb.edu.au

(w) +61 3 8344 1217

(m) +61 4 0333 1413

(us) +1 505 349 4425

www.moresi.infohttp://www.moresi.info/

www.facebook.com/underworldcodehttp://www.facebook.com/underworldcode

@LouisMoresihttps://twitter.com/LouisMoresi

On 14 Aug 2019, 12:44 PM +1000, gthyagi notifications@github.com, wrote:

Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs. There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansourhttps://github.com/jmansour, thoughts?

Yes. some stats on the file sizes I am uploading

(384, 512, 256) res model has swarm size 23gb and matVar size 3.8gb
(512, 512, 256) res model has swarm size 30gb and matVar size 10gb
(768, 1024, 256) res model has swarm size 90gb and matVar size 30gb

current working case stats (256, 256, 128) res model has swarm size 3.8gb and matVar size 640mb

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/underworldcode/underworld2/issues/408?email_source=notifications&email_token=ADABPI475K326LQ5EUY4EIDQENWPLA5CNFSM4ILKUPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4HRB3Y#issuecomment-521081071, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADABPI74NG4FX3YIZEAYV3LQENWPLANCNFSM4ILKUPFQ.

jmansour commented 5 years ago

So for a 320^3 element model with 40 particles per cell (1.3 billion particles), it takes around 1 minute to load the swarm (on Magnus). However, the mesh is regular, but more importantly, it is re-reading a checkpoint file that we created, and therefore leverages the ordering of particles for efficiency (each proc starts by reading the particles it expects to own).

Your job is larger, and the particle ordering cannot be considered when re-reading. This therefore reduces to a serial operation, with all processes needing to consider the entire swarm checkpoint file. This operation therefore does not scale, so I'm not surprised you're hitting the walltime limit.

Kdtree etc may help, but it will depend on whether this is an IO bottleneck or a CPU bottleneck (owning element search).

Can you remind me why you're defining your own particle coordinates to begin with, instead of using a layout object?

gthyagi commented 5 years ago

@jmansour I am using layout. First I created and saved swarm coordinates and material variable.

swarm           = uw.swarm.Swarm( mesh=mesh, particleEscape=True )
swarmLayout     = uw.swarm.layouts.PerCellSpaceFillerLayout( swarm=swarm, particlesPerCell=20 )
swarm.populate_using_layout( layout=swarmLayout )
swarm.save(outputPath + 'swarm_'+str(res)+'.h5')

Then saved swarm coordinates are indexed using matlab script. I am using matlab script to create geometries in 3D. Let say index all particles in oceanic lithosphere or in continental lithosphere or UM or LM. Once this is done then I load swarm and matVar to Raijin to solve for Stokes.

swarm = uw.swarm.Swarm(mesh, particleEscape=True)
swarm.load('./swarmMatInd_sph/'+'swarm_'+str(res)+'.h5')
materialVariable       = swarm.add_variable("int", 1)
materialVariable.load('./swarmMatInd_sph/'+'matInd_'+str(res)+'.h5')

gthyagi commented 5 years ago

Now, I will submit a job to check loading time. updates will be posted soon.

jmansour commented 5 years ago

So is this your process?:

Create a swarm checkpoint file via underworld.
Use that within your matlab script to c to generate your matInd file.
Use the swarm file created at step 1 (unmodified) and the material index created at step 2 to generate your runtime swarm.

If that's the case, really we don't even need to reload the swarm, as the programmatic swarm will be identical to the loaded one. Unfortunately we currently have a check which will prevent this, but perhaps an override isn't a bad idea. @julesghub ?

However, in this case, I'm also less certain as to why it's so slow. Can you confirm @gthyagi that the checkpointed swarm file is unchanged when you attempted to reload it?

gthyagi commented 5 years ago

@jmansour Yes, it is unchanged swarm coords created in the step1 and corresponding material indices which are created in step2 using matlab. Now I am using these swarm coords and material indices as starting configuration in my models. Currently I am running instantaneous models (i.e., 2 time steps).

julesghub commented 5 years ago

If that's the case, really we don't even need to reload the swarm, as the programmatic swarm will be identical to the loaded one. Unfortunately we currently have a check which will prevent this, but perhaps an override isn't a bad idea. @julesghub ?

that would work as long as the mesh used in step 1) is the same mesh, i.e. resolution/size, as step 3)

What's the check preventing this? an override sounds good.

jmansour commented 5 years ago

When we reload the swarm, we build an array which specifies how to index into the swarm variable data (ie, the mapping from local index to file index). If we haven't reloaded the swarm, this doesn't exist.

However in @gthyagi's case, it will be a direct mapping (well, offset to local chunks).

julesghub commented 5 years ago

This line? https://github.com/underworldcode/underworld2/blob/6da9f52268d366ae08533374afebb6f278c04576/underworld/swarm/_swarmvariable.py#L271

jmansour commented 5 years ago

Yep

On Thu, Aug 15, 2019 at 9:36 AM Julian Giordani notifications@github.com wrote:

This line?

https://github.com/underworldcode/underworld2/blob/6da9f52268d366ae08533374afebb6f278c04576/underworld/swarm/_swarmvariable.py#L271

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/408?email_source=notifications&email_token=AAK7NHIKYQUZRH3CWRN7KFLQESJGRA5CNFSM4ILKUPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KND3A#issuecomment-521458156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK7NHJQHPWAUYP2MOS5IODQESJGRANCNFSM4ILKUPFQ .

gthyagi commented 5 years ago

RuntimeErrorTraceback (most recent call last)
<ipython-input-9-575b8a04017d> in <module>
----> 1 materialVariable.load('./swarmMatInd_sph/'+'matInd_'+str(res)+'.h5')

/usr/local/underworld2/lib/underworld/swarm/_swarmvariable.py in load(self, filename, collective)
    270 
    271         if self.swarm._checkpointMapsToState != self.swarm.stateId:
--> 272             raise RuntimeError("'Swarm' associate with this 'SwarmVariable' does not appear to be in the correct state.\n" \
    273                                "Please ensure that you have loaded the swarm prior to loading any swarm variables.")
    274         gIds = self.swarm._local2globalMap

RuntimeError: 'Swarm' associate with this 'SwarmVariable' does not appear to be in the correct state.
Please ensure that you have loaded the swarm prior to loading any swarm variables.

From this error I understood your discussion more clearly. How direct mapping is achieved?

gthyagi commented 5 years ago

Now, I will submit a job to check loading time. updates will be posted soon.

Here time represents just the loading time of swarm and material variable. Legends represent resolution of the model in (lon, lat, r) direction and requested number of cpus for that resolution. swarmMat_load_time

jmansour commented 5 years ago

@gthyagi, the mapping is from the file index to the local particle index.

So you might have a swarm checkpoint file describing 1B particles, so the h5 file will take indices from 0->999999999. Now if each process might have 1M particles, taking local indices 0->999999. The mapping is from the local index to the h5 index. Eg, local particle index 113, needs to refer to h5 index 2345231 for its data.

When we save swarm checkpoint files, each process records a contiguous chunk of data, offset into the array so it doesn't write over other process's data. We also record a hint array to the file which specifies how many particles each process recorded.

Reloading the swarm is a bit more complex, because we make no assumptions about the number of processes previously used. Therefore, generally speaking, each process must re-read the entire file to determine which particles it should own. However, if reload conditions are identical to save conditions (same process count, same resolution, same mesh geometry), each process will own exactly the same chunk of data it saved, and this is where the hint array becomes useful. We can use this at reload time to know exactly where to start reading, and each process should try and read only the particles it saved (no more, no less). This is effectively the direct mapping.

@gthyagi, can you also confirm that the following are also identical between save() and load() operations?

Process count.
Mesh resolution.
Mesh geometry.

jmansour commented 5 years ago

@julesghub, we should add a warning when we revert to brute force swarm reload strategies, as it will make a dramatic difference to reload times.

For very large simulations (>10k processes), swarm save/load may be prohibitively expensive. Another strategy that we could consider:

At save time, generate a second swarm using a layout object.
Map any required swarm variables to new swarm (using kdtree).
Save swarm variables only. The swarm particles themselves will be regenerated programmatically.

This will be a little bit diffuse of course, but probably acceptable if not done too often.

julesghub commented 5 years ago

All ideas sound good to me. The approx swarm reload would be great as an option. On 16 Aug 2019, at 09:04, jmansour notifications@github.com<mailto:notifications@github.com> wrote:

@julesghubhttps://github.com/julesghub, we should add a warning when we revert to brute force swarm reload strategies, as it will make a dramatic difference to reload times.

For very large simulations (>10k processes), swarm save/load may be prohibitively expensive. Another strategy that we could consider:

At save time, generate a second swarm using a layout object.
Map any required swarm variables to new swarm (using kdtree).
Save swarm variables only. The swarm particles themselves will be regenerated programmatically. This will be a little bit diffuse of course, but probably acceptable if not done too often.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/underworldcode/underworld2/issues/408?email_source=notifications&email_token=ADJPNKGX7QTKZCKLZOOMYHTQEXOFXA5CNFSM4ILKUPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4NHGKQ#issuecomment-521827114, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADJPNKCNTLTWCBMRSNL2MFDQEXOFXANCNFSM4ILKUPFQ.

gthyagi commented 5 years ago

@gthyagi, the mapping is from the file index to the local particle index.

So you might have a swarm checkpoint file describing 1B particles, so the h5 file will take indices from 0->999999999. Now if each process might have 1M particles, taking local indices 0->999999. The mapping is from the local index to the h5 index. Eg, local particle index 113, needs to refer to h5 index 2345231 for its data.

When we save swarm checkpoint files, each process records a contiguous chunk of data, offset into the array so it doesn't write over other process's data. We also record a hint array to the file which specifies how many particles each process recorded.

Reloading the swarm is a bit more complex, because we make no assumptions about the number of processes previously used. Therefore, generally speaking, each process must re-read the entire file to determine which particles it should own. However, if reload conditions are identical to save conditions (same process count, same resolution, same mesh geometry), each process will own exactly the same chunk of data it saved, and this is where the hint array becomes useful. We can use this at reload time to know exactly where to start reading, and each process should try and read only the particles it saved (no more, no less). This is effectively the direct mapping.

@gthyagi, can you also confirm that the following are also identical between save() and load() operations?

Process count.

Mesh resolution.

Mesh geometry.

I would like make sure that process count is same as requested cpus right. If so the process count, mesh resolution and mesh geometry are same between save and load.

gthyagi commented 5 years ago

@gthyagi, the mapping is from the file index to the local particle index.

So you might have a swarm checkpoint file describing 1B particles, so the h5 file will take indices from 0->999999999. Now if each process might have 1M particles, taking local indices 0->999999. The mapping is from the local index to the h5 index. Eg, local particle index 113, needs to refer to h5 index 2345231 for its data.

When we save swarm checkpoint files, each process records a contiguous chunk of data, offset into the array so it doesn't write over other process's data. We also record a hint array to the file which specifies how many particles each process recorded.

Reloading the swarm is a bit more complex, because we make no assumptions about the number of processes previously used. Therefore, generally speaking, each process must re-read the entire file to determine which particles it should own. However, if reload conditions are identical to save conditions (same process count, same resolution, same mesh geometry), each process will own exactly the same chunk of data it saved, and this is where the hint array becomes useful. We can use this at reload time to know exactly where to start reading, and each process should try and read only the particles it saved (no more, no less). This is effectively the direct mapping.

@gthyagi, can you also confirm that the following are also identical between save() and load() operations?

Process count.

Mesh resolution.

Mesh geometry.

Now it makes more sense why I got proper output (figure in the issue) when 512 cpus are used for 256256128 resolution model and no output (from loading time plot) for the same resolution in 256 cpus. I requested 512 cpus to create swarm particles in all models.

jmansour commented 5 years ago

@gthyagi yep process count == requested CPUs.

Here are results for load times on Magnus. Dashed line is using collective read, solid line is not. You can enable collective read using swarm.load("your_filename",collective=True), but for swarm loads collective is slower (under the test conditions). Swarm variable load is faster with collective.

Anyhow, i'll note that for this job, we're using 32^32^32 per process, while for your largest job you're closer to 64^64^64, so perhaps your job is little under-decomposed. Can you try lowering your resolution such that its closer to 32^32^32 per process at 1000 procs? This way we have a clearer comparison with Magnus.

uw28_load_operations

gthyagi commented 5 years ago

@gthyagi yep process count == requested CPUs.

Here are results for load times on Magnus. Dashed line is using collective read, solid line is not. You can enable collective read using swarm.load("your_filename",collective=True), but for swarm loads collective is slower (under the test conditions). Swarm variable load is faster with collective.

Anyhow, i'll note that for this job, we're using 32^32^32 per process, while for your largest job you're closer to 64^64^64, so perhaps your job is little under-decomposed. Can you try lowering your resolution such that its closer to 32^32^32 per process at 1000 procs? This way we have a clearer comparison with Magnus.

I will do this exercise using spherical regional mesh as soon as possible. Stay tuned for further updates.

julesghub commented 4 years ago

Closing due to inactivity and Raijin is no longer.

underworldcode / underworld2

Error in running high resolution 3D spherical models using regionalMesh branch in Raijin #408