simonsobs-uk / data-centre

This tracks the issues in the baseline design of the SO:UK Data Centre at Blackett
https://souk-data-centre.readthedocs.io
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

MPI_ERR_NO_SUCH_FILE because different physical nodes do not have a shared file system #15

Closed ickc closed 9 months ago

ickc commented 9 months ago

Transferred from a Slack conversation.

People involved: @DanielBThomas, @chervias.


Dan Thomas 8:03 PM Hi both Kolen as I mentioned, Carlos has encountered some issues on Blackett that we are hoping you can help with. This is his latest message: Do you know in Blackett if I launch with 2 MPI jobs (machine_count = 2) the /tmp dir is the same one seen in both jobs? I cannot successfully run the toast tests python -c 'import toast.tests; toast.tests.run()' with MPI as described here https://souk-data-centre.readthedocs.io/en/latest/user/pipeline/3-MPI-applications/1-OpenMPI/ I get a IO error, I suspect a file cannot be accessed since at least one of the MPI jobs cannot see a directory. But the toast tests work fine in a single job.

Carlos Hervías 8:06 PM this is the output from running the mpi.ini example as exactly described in the instructions. You can see the code failing at the end This file was deleted. 8:07 Plain Text

mpi-0.out.txt Plain Text

8:09 I get this MPI_ERR_NO_SUCH_FILE: no such file or directory which I suspect because its trying to write a file to a directory that is not there (edited) 8:13 this runs successfully in a single job, only fails for me when I try parallel

Kolen Cheung 6:28 PM Hi, Carlos. Sorry for the late reply, comments below:

  1. different HTCondor processes sees a different /tmp directory, even when they lands on the same physical nodes (/tmp is actually a symlink to somewhere within scratch there, and scratch is unique to HTCondor processes.) I will add this to our documentation.
  2. regarding your MPI_ERR_NO_SUCH_FILE situation, I am able to reproduce your error and will investigate into it. (This error is stateful as I haven’t seen it last I set it up.) For the interim, you can run MPI applications with 1 node. It can be done either by changing machine_count to 1 in https://simonsobs-uk.github.io/data-centre/user/pipeline/1-classad/3-classad-parallel/, or by adapting the Vanilla Universe job to call mpirun directly without calling my wrapper script. I’ll document this later method in the documentation.

Carlos Hervías 2:16 AM Thank you Kolen, so if I set machine_count=1 would that be a single MPI job?

Carlos Hervías 2:22 AM So the only way I can take full advantage of many processors is by running a single MPI job with 64 threads or something?

Kolen Cheung 11:26 AM Machine count is the number of nodes. I think the largest node we have right now has 20 threads, which equals to 10 cores. Once you request machine count equals to 1 and number of cpu equals to 20, in that job you can use mpirun/mpiexec -n 10 …

11:28 Correction: 10 physical cores per CPU, but it has 2 sockets. So you can request 40 cpu and set -n 20 in MPI.

Dan Thomas 11:33 AM we have 8 64 thread machines; I think 4 can run vanilla universe and they can all run parallel universe.

Kolen Cheung 11:41 AM If you set machine_count=1 and request parallel universe, it should works. Remember the request_cpu corresponds to number of logical cores, so requesting 64 of them corresponds to -n 32.

Carlos Hervías 1:17 PM ok thanks, I did some tests in the vanilla universe where I requested 32 cpus and then ran toast with mpiexec -n 8 … for example (I’m running each job with 4 threads). What is the maximum cpus I can ask for right now? I could not ask more than 32 in my tests, the job would go into the idle queue. So if I wanted to run 32 mpi jobs with 2 threads each for example, I could do it requesting 64 cpus?

Kolen Cheung 1:29 PM You may need to change it to the parallel universe in order to use the 64 threads machine. I think you mean an MPI job with 32 processes? When you say 2 threads each, do you mean OMP_NUM_THREADS? Bear in mind if you requests 64 cpus, that corresponds to 32 physical core, so the recommended setting would be 32 MPI processes with OMP_NUM_THREADS=1 (which is set automatically if you use my wrapper script for parallel universe.)

1:31 Ie N_MPI_PROCESSES times OMP_NUM_THREADS should equals to total no of physical cores. Otherwise there will be oversubscription and you will find it slower (in most cases except those that are IO bound.)

Carlos Hervías 1:31 PM ah ok I get it, thanks! I thought the hyperthreading was on, but it is best not to use it

Kolen Cheung 1:40 PM Yes, hyperthreading is enabled, but only in specialized cases that would be beneficial, so the recommendation is not to setup your OMP threading like that unless it is proven to be useful.

Kolen Cheung 10:01 PM Hi, @Carlos Hervías , just to go back to your previous issue about the MPI error. You can safely ignore that. The TOAST 3 test assumes that different MPI processes can see the filesystem at the same path, which is not true in Blackett. I.e. the master processes created a directory at a certain path, and one of the other processes are writing to that directory. It also means that any script making that assumption would break, but that should be ok most of the time, as MPI processes shouldn’t rely on the filesystem to communicate anyway (hence a script trying to have one process write to a file and have another process to read it is slim). For scripts that assumes this, it should be a quick and easy fix.

ickc commented 9 months ago

To reiterate, this results from an assumption that different MPI processes can access the same filesystem at the same path, which is not true at Blackett. Any script / test needs to be modified to remove this assumption, which is typically not needed in MPI applications.