Closed gthyagi closed 4 years ago
Not much to go on with that general message. Could you send the preceding part of the output? We need to determine at what stage this error occurred e.g. swarm load, solve.
@julesghub, I am still performing few more tests with high resolution models. Soon i will provide detailed statistics. At present I have submitted a job at (384, 512, 256) resolution requesting 1024 cpu and 2.0tb memory. In this case job exceeded the walltime (4hrs). For (256, 256, 128) resolution model took 1hr 14 min. From this I made a wild guess that 4hr should be efficient to run (384, 512, 256) resolution model with 1024 cpu and 2.0tb memory.
Not much to go on with that general message. Could you send the preceding part of the output? We need to determine at what stage this error occurred e.g. swarm load, solve.
output file contains walltime exceeded message and above general error.
Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs.
There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansour, thoughts?
@julesghub My high resolution models workflow:
Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs.
There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansour, thoughts?
Yes. some stats on the file sizes I am uploading
@julesghub My high resolution models workflow:
- First I created and saved swarm coordinates and material variable.
- Then I use matlab script to index the swarm coordinates in serial.
- Finally swarm coordinates and material variable files are loaded to cluster to solve for the stokes system. Suspected steps where time is wasted enormously:
- Loading swarm coordinates and material variable Q : Is there a size limit on the uploaded h5 files of the swarm coordinate and material variable ?
When I file is upload it is distributed over all processor or uploaded to a single processor?
Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs. There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansour, thoughts?
Yes. some stats on the file sizes I am uploading
- (384, 512, 256) res model has swarm size 23gb and matVar size 3.8gb
- (512, 512, 256) res model has swarm size 30gb and matVar size 10gb
- (768, 1024, 256) res model has swarm size 90gb and matVar size 30gb
current working case stats (256, 256, 128) res model has swarm size 3.8gb and matVar size 640mb
We can develop a parallel safe kd tree it’s pretty lightweights
Prof Louis Moresi
louis.moresi@unimelb.edu.aumailto:louis.moresi@unimelb.edu.au
(w) +61 3 8344 1217
(m) +61 4 0333 1413
(us) +1 505 349 4425
www.moresi.infohttp://www.moresi.info/
www.facebook.com/underworldcodehttp://www.facebook.com/underworldcode
@LouisMoresihttps://twitter.com/LouisMoresi
On 14 Aug 2019, 12:44 PM +1000, gthyagi notifications@github.com, wrote:
Right, so it looks like the reload of the swarm is the culprit. It's too slow and is exceeding the walltime. The algorithm to find particles in the "irregular" mesh case is ok when you've generated the particles, but when you load particle in from another mesh it's slow. I suspect this is the big bottleneck for your large runs. There are some tricks we could do - employ kd-tree (not yet parallel safe) or transform the coordinate for the search into a regular space. I prefer the kd-tree as it's a general solution but at present it's not available in parallel. @jmansourhttps://github.com/jmansour, thoughts?
Yes. some stats on the file sizes I am uploading
current working case stats (256, 256, 128) res model has swarm size 3.8gb and matVar size 640mb
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/underworldcode/underworld2/issues/408?email_source=notifications&email_token=ADABPI475K326LQ5EUY4EIDQENWPLA5CNFSM4ILKUPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4HRB3Y#issuecomment-521081071, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADABPI74NG4FX3YIZEAYV3LQENWPLANCNFSM4ILKUPFQ.
So for a 320^3 element model with 40 particles per cell (1.3 billion particles), it takes around 1 minute to load the swarm (on Magnus). However, the mesh is regular, but more importantly, it is re-reading a checkpoint file that we created, and therefore leverages the ordering of particles for efficiency (each proc starts by reading the particles it expects to own).
Your job is larger, and the particle ordering cannot be considered when re-reading. This therefore reduces to a serial operation, with all processes needing to consider the entire swarm checkpoint file. This operation therefore does not scale, so I'm not surprised you're hitting the walltime limit.
Kdtree etc may help, but it will depend on whether this is an IO bottleneck or a CPU bottleneck (owning element search).
Can you remind me why you're defining your own particle coordinates to begin with, instead of using a layout object?
@jmansour I am using layout. First I created and saved swarm coordinates and material variable.
swarm = uw.swarm.Swarm( mesh=mesh, particleEscape=True )
swarmLayout = uw.swarm.layouts.PerCellSpaceFillerLayout( swarm=swarm, particlesPerCell=20 )
swarm.populate_using_layout( layout=swarmLayout )
swarm.save(outputPath + 'swarm_'+str(res)+'.h5')
Then saved swarm coordinates are indexed using matlab script. I am using matlab script to create geometries in 3D. Let say index all particles in oceanic lithosphere or in continental lithosphere or UM or LM. Once this is done then I load swarm and matVar to Raijin to solve for Stokes.
swarm = uw.swarm.Swarm(mesh, particleEscape=True)
swarm.load('./swarmMatInd_sph/'+'swarm_'+str(res)+'.h5')
materialVariable = swarm.add_variable("int", 1)
materialVariable.load('./swarmMatInd_sph/'+'matInd_'+str(res)+'.h5')
Now, I will submit a job to check loading time. updates will be posted soon.
So is this your process?:
matInd
file.If that's the case, really we don't even need to reload the swarm, as the programmatic swarm will be identical to the loaded one. Unfortunately we currently have a check which will prevent this, but perhaps an override isn't a bad idea. @julesghub ?
However, in this case, I'm also less certain as to why it's so slow. Can you confirm @gthyagi that the checkpointed swarm file is unchanged when you attempted to reload it?
@jmansour Yes, it is unchanged swarm coords created in the step1 and corresponding material indices which are created in step2 using matlab. Now I am using these swarm coords and material indices as starting configuration in my models. Currently I am running instantaneous models (i.e., 2 time steps).
If that's the case, really we don't even need to reload the swarm, as the programmatic swarm will be identical to the loaded one. Unfortunately we currently have a check which will prevent this, but perhaps an override isn't a bad idea. @julesghub ?
that would work as long as the mesh used in step 1) is the same mesh, i.e. resolution/size, as step 3)
What's the check preventing this? an override sounds good.
When we reload the swarm, we build an array which specifies how to index into the swarm variable data (ie, the mapping from local index to file index). If we haven't reloaded the swarm, this doesn't exist.
However in @gthyagi's case, it will be a direct mapping (well, offset to local chunks).
Yep
On Thu, Aug 15, 2019 at 9:36 AM Julian Giordani notifications@github.com wrote:
This line?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/underworldcode/underworld2/issues/408?email_source=notifications&email_token=AAK7NHIKYQUZRH3CWRN7KFLQESJGRA5CNFSM4ILKUPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KND3A#issuecomment-521458156, or mute the thread https://github.com/notifications/unsubscribe-auth/AAK7NHJQHPWAUYP2MOS5IODQESJGRANCNFSM4ILKUPFQ .
RuntimeErrorTraceback (most recent call last)
<ipython-input-9-575b8a04017d> in <module>
----> 1 materialVariable.load('./swarmMatInd_sph/'+'matInd_'+str(res)+'.h5')
/usr/local/underworld2/lib/underworld/swarm/_swarmvariable.py in load(self, filename, collective)
270
271 if self.swarm._checkpointMapsToState != self.swarm.stateId:
--> 272 raise RuntimeError("'Swarm' associate with this 'SwarmVariable' does not appear to be in the correct state.\n" \
273 "Please ensure that you have loaded the swarm prior to loading any swarm variables.")
274 gIds = self.swarm._local2globalMap
RuntimeError: 'Swarm' associate with this 'SwarmVariable' does not appear to be in the correct state.
Please ensure that you have loaded the swarm prior to loading any swarm variables.
From this error I understood your discussion more clearly. How direct mapping is achieved?
Now, I will submit a job to check loading time. updates will be posted soon.
Here time represents just the loading time of swarm and material variable. Legends represent resolution of the model in (lon, lat, r) direction and requested number of cpus for that resolution.
@gthyagi, the mapping is from the file index to the local particle index.
So you might have a swarm checkpoint file describing 1B particles, so the h5 file will take indices from 0->999999999. Now if each process might have 1M particles, taking local indices 0->999999. The mapping is from the local index to the h5 index. Eg, local particle index 113, needs to refer to h5 index 2345231 for its data.
When we save swarm checkpoint files, each process records a contiguous chunk of data, offset into the array so it doesn't write over other process's data. We also record a hint array to the file which specifies how many particles each process recorded.
Reloading the swarm is a bit more complex, because we make no assumptions about the number of processes previously used. Therefore, generally speaking, each process must re-read the entire file to determine which particles it should own. However, if reload conditions are identical to save conditions (same process count, same resolution, same mesh geometry), each process will own exactly the same chunk of data it saved, and this is where the hint array becomes useful. We can use this at reload time to know exactly where to start reading, and each process should try and read only the particles it saved (no more, no less). This is effectively the direct mapping.
@gthyagi, can you also confirm that the following are also identical between save()
and load()
operations?
@julesghub, we should add a warning when we revert to brute force swarm reload strategies, as it will make a dramatic difference to reload times.
For very large simulations (>10k processes), swarm save/load may be prohibitively expensive. Another strategy that we could consider:
This will be a little bit diffuse of course, but probably acceptable if not done too often.
All ideas sound good to me. The approx swarm reload would be great as an option. On 16 Aug 2019, at 09:04, jmansour notifications@github.com<mailto:notifications@github.com> wrote:
@julesghubhttps://github.com/julesghub, we should add a warning when we revert to brute force swarm reload strategies, as it will make a dramatic difference to reload times.
For very large simulations (>10k processes), swarm save/load may be prohibitively expensive. Another strategy that we could consider:
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/underworldcode/underworld2/issues/408?email_source=notifications&email_token=ADJPNKGX7QTKZCKLZOOMYHTQEXOFXA5CNFSM4ILKUPF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4NHGKQ#issuecomment-521827114, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADJPNKCNTLTWCBMRSNL2MFDQEXOFXANCNFSM4ILKUPFQ.
@gthyagi, the mapping is from the file index to the local particle index.
So you might have a swarm checkpoint file describing 1B particles, so the h5 file will take indices from 0->999999999. Now if each process might have 1M particles, taking local indices 0->999999. The mapping is from the local index to the h5 index. Eg, local particle index 113, needs to refer to h5 index 2345231 for its data.
When we save swarm checkpoint files, each process records a contiguous chunk of data, offset into the array so it doesn't write over other process's data. We also record a hint array to the file which specifies how many particles each process recorded.
Reloading the swarm is a bit more complex, because we make no assumptions about the number of processes previously used. Therefore, generally speaking, each process must re-read the entire file to determine which particles it should own. However, if reload conditions are identical to save conditions (same process count, same resolution, same mesh geometry), each process will own exactly the same chunk of data it saved, and this is where the hint array becomes useful. We can use this at reload time to know exactly where to start reading, and each process should try and read only the particles it saved (no more, no less). This is effectively the direct mapping.
@gthyagi, can you also confirm that the following are also identical between
save()
andload()
operations?
- Process count.
- Mesh resolution.
- Mesh geometry.
I would like make sure that process count is same as requested cpus right. If so the process count, mesh resolution and mesh geometry are same between save and load.
@gthyagi, the mapping is from the file index to the local particle index.
So you might have a swarm checkpoint file describing 1B particles, so the h5 file will take indices from 0->999999999. Now if each process might have 1M particles, taking local indices 0->999999. The mapping is from the local index to the h5 index. Eg, local particle index 113, needs to refer to h5 index 2345231 for its data.
When we save swarm checkpoint files, each process records a contiguous chunk of data, offset into the array so it doesn't write over other process's data. We also record a hint array to the file which specifies how many particles each process recorded.
Reloading the swarm is a bit more complex, because we make no assumptions about the number of processes previously used. Therefore, generally speaking, each process must re-read the entire file to determine which particles it should own. However, if reload conditions are identical to save conditions (same process count, same resolution, same mesh geometry), each process will own exactly the same chunk of data it saved, and this is where the hint array becomes useful. We can use this at reload time to know exactly where to start reading, and each process should try and read only the particles it saved (no more, no less). This is effectively the direct mapping.
@gthyagi, can you also confirm that the following are also identical between
save()
andload()
operations?
- Process count.
- Mesh resolution.
- Mesh geometry.
Now it makes more sense why I got proper output (figure in the issue) when 512 cpus are used for 256256128 resolution model and no output (from loading time plot) for the same resolution in 256 cpus. I requested 512 cpus to create swarm particles in all models.
@gthyagi yep process count == requested CPUs.
Here are results for load times on Magnus. Dashed line is using collective read, solid line is not. You can enable collective read using swarm.load("your_filename",collective=True)
, but for swarm loads collective is slower (under the test conditions). Swarm variable load is faster with collective.
Anyhow, i'll note that for this job, we're using 32^32^32 per process, while for your largest job you're closer to 64^64^64, so perhaps your job is little under-decomposed. Can you try lowering your resolution such that its closer to 32^32^32 per process at 1000 procs? This way we have a clearer comparison with Magnus.
@gthyagi yep process count == requested CPUs.
Here are results for load times on Magnus. Dashed line is using collective read, solid line is not. You can enable collective read using
swarm.load("your_filename",collective=True)
, but for swarm loads collective is slower (under the test conditions). Swarm variable load is faster with collective.Anyhow, i'll note that for this job, we're using 32^32^32 per process, while for your largest job you're closer to 64^64^64, so perhaps your job is little under-decomposed. Can you try lowering your resolution such that its closer to 32^32^32 per process at 1000 procs? This way we have a clearer comparison with Magnus.
I will do this exercise using spherical regional mesh as soon as possible. Stay tuned for further updates.
Closing due to inactivity and Raijin is no longer.
Hi @jmansour @julesghub , I have successfully ran low resolution 3D spherical models in Raijin. Following images is created with (256, 256, 128) resolution in (lon, lat, rad) directions.
When I increase the resolution to (384, 512, 256) or (512, 512, 256) or (768, 1024, 256) produces this error:
Note: I am loading the h5 files of indexed swarm coordinates and material variable instead of creating them on cluster.
Thanks.