radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

Gain access to read-only NFS fs on Titan #26

Closed mturilli closed 7 years ago

mpbl commented 7 years ago

We will need to check performance of reading from NSF vs Lustre for a single earthquake simulation.

mpbl commented 7 years ago

I have looked at what we have access to.

Input-size, we need 62 GB for a model with a resolution of 17s. As we are currently running two slightly different studies, for this method to be available to both requires 124 GB.

The issue lies with the /ccs/proj being only 50GB, with 20GB that have already been hijacked by the project members.

I'll need to figure if we can get more space.

rmodrak commented 7 years ago

Matthieu is aware of this, but I wanted to post an update in case Wenjie, Youyi or others from Princeton might be following.

The required read-only disk space could be cut significantly by reducing redundancy in the SPECFEM3D_GLOBE database files. There is a lot of information in these files that could be recreated on the fly rather than read in, but this would require tedious modification to specfem3d/read_mesh_databases and related routines.

So it's a question of priority. Solving the read-only disk space issue would allow us to launch titan simulations in a much cleaner way and have major benefits in terms of fault tolerance, but would require a lot of tedious work and, in the worst case, may introduce unwanted complexity to the solver.

mpbl commented 7 years ago

@rmodrak I believe this issue mostly belongs to specfem's issue board. Would you mind posting it there to stem a discussion among specfem's developers? Thanks.

mpbl commented 7 years ago

OLCF is working on it.

mturilli commented 7 years ago

Granted required space, needs to be implemented by OLCF

mpbl commented 7 years ago

Done. I have copied some older mesh to ensure nobody hijack the disk space.

mpbl commented 7 years ago

Does not seem to be working well.

Test Case Time to read input files
Lustre, 1 event 4.2 s
NFS, 1 event 56 s
NFS, 10 concurrent events 410 to 468 s

So I'll guess we will be forced to continue using this ugly "simultaneous run feature" for now.

It may have an impact on the design of the EnTK, as simulations cannot be considered to be completely atomic.

mturilli commented 7 years ago

Wow! This contradicts evidence from another other experiment we are collaborating with. Let us know whether we can help too coordinate with OLCF.

mpbl commented 7 years ago

Thanks. I am not sure what you mean by coordinating with the OLCF. In your opinion, can this behavior be fixed so it works equally well on ro-NFS?

mturilli commented 7 years ago

I am not sure whether it can be fixed but I doubt it is was they expect in terms of performance. I would open a ticket, reporting about the performance you measured and asking whether there is a problem with their NFS or whether they consider those figure normal. Apologies if you did this already.

Depending on their answer, I would be happy to ask to the other group we work with whether they experienced (and fixed) the same issues and, in case, to ask directly to OLCF.

mpbl commented 7 years ago

I have asked the OLCF why we were seeing poor performance when using the DVS to read mesh files. According to their system guys, it is normal and we should not be expecting improvements by using it. They recommend to stick with Lustre.

On Summit and other similar machines, we will be able to pre-stage data in nvme / nvram and the the issue will fade away.

In the meantime, we probably have to keep our dirty fix around, that is the simultaneous event trick, even if it partially defies the purpose of using a pilot job. We can start by grouping the events by group of 5 - 10. This will allow us to get a fraction of the benefits coming from the use of a pilot job.

mturilli commented 7 years ago

Abandoned due to poor performance