How does EnTK distribute mesh files to all simulations?

uvaaland commented 5 years ago

Considering the following diagram of the tomography pst model:

https://github.com/radical-collaboration/hpc-workflows-paper-y1/blob/master/figures/use_case_tomo_workflow_pst.pdf

How are the mesh files from the first stage made available to the forward simulations in the second stage?

Looking at the following script that was used to run the forward simulations on Titan, it looks like a symbolic link is created that points to the location of the mesh files (line 95).

https://github.com/radical-collaboration/hpc-workflows/blob/master/scripts/application_seisflow/fwd_sims.py

But multiple tasks trying to read the same resource must surely become an issue at scale? That is, when the number of tasks trying to access the resource increases. Furthermore, providing a copy of the folder to each task would also be expensive. This has come up in conversation in our group meetings, and we wanted to get your input on this.

andre-merzky commented 5 years ago

Reading the same file from a shared file system (e.g., by linking the file into the task sandboxes) should be fairly efficient, as the FS will move the file into local caches and optimize transfer. Manually staging the data is likely slower, and only makes sense if that can be done before the tasks start - but since the compute nodes they will land on are not known, that won't work.

Concurrently writing data to the same file is a different story - but it sounds like you are only considering shared read access right now, correct?

mturilli commented 5 years ago

Performance of concurrent readings tends to decrease with scale but when performance degradation becomes an issue for an actual workflow, scale and machine depends on several considerations. For example:

Capabilities of the serving filesystem: what is the performance curve for the actual storage subsystem, workflow and desired scale? For example, Summit offers more types of filesystems with much better performance than Titan.
File number and size: performance of concurrent readings varies with these quantities.
Performance overheads of viable alternatives: as you noticed, copying files also imposes an overhead. How the two overheads compare?
Total time to completion (TTC) of the workflow: the cost of an optimization might not be worth when the file reading overhead is a small percentage of TTC.

Depending on the observed behavior of the test runs and barring an obvious performance problem, we tend to answer to these and many other considerations by characterizing the performance of the workflow up to the desidered scale on the intended machine. In this way, we can evaluate the relevant trade offs and overheads while considering alternative designs.

AFAIK, we did not characterize the the overhead of concurrent readings for your use case and no obvious performance issues where reported. Did you notice a marked slowdown or do you have projections that show that at the intended scale this will become an issue? In case, I would be happy to discuss/help with this characterization.

From a practical point of view, I would use this characterization to evaluate alternative scenarios as, for example, creating a certain amount of copies and partitioning the set of forward simulations over these copies. Without a characterization, I am not able to evaluate whether this would indeed yield better performance.

uvaaland commented 5 years ago

Thank you both for these thorough answers. This was not prompted by an observation that we have seen in our runs, it was simply a curiosity.

In the paper you found that the optimal number of concurrent tasks on Titan was 16. Given the amount of resources required by a single task (384 nodes) one would not be able to run more than tens of concurrent tasks on Titan. That is, the prospect of spawning hundreds or thousands of concurrent tasks would not be feasible.

From what I understand, part of the reason why the symbolic link solution is feasible is because 16 concurrent tasks is not a huge number. If we were talking about hundreds or thousands of tasks, that might change the equation. Am I understanding this correctly?

mturilli commented 5 years ago

I think you understand this correctly. Just consider that usually this is a problem for concurrent writing and much less with concurrent reading, and that if we want to understand 'how that equation changes' then we have to account for factors like those I listed above. Without a quantitative description of that 'equation behavior', we don't know whether the performance difference is as small as to be irrelevant or as big as to represent a bottleneck.

radical-collaboration / hpc-workflows

How does EnTK distribute mesh files to all simulations? #92