radical-collaboration / hpc-workflows

NSF16514 EarthCube Project - Award Number:1639694
5 stars 0 forks source link

EnTK on the TigerGPU cluster #95

Closed lsawade closed 4 years ago

lsawade commented 5 years ago

Hi,

As @mturilli suggested (Here #82 ), I'll open a new ticket fo the more specific GPU problem.

To warn everyone, I do believe that this is caused by error 40 (the error sits 40cm in front of the screen).

First of all, the necessary locations:

If you decide to compile specfem with GPU support by yourself again, I suggest the following sequence of commands from within the specfem directory to compile the code.

make clean
./configure.tiger.bash.GPU
make -j all

The resource slurm script that I used as a reference is located here

/tigress/lsawade/specfem3d_globe/submit_GPU.sh

I gave permissions to access the above directories and am going to keep my hands off of them for now. The session I'm referring to is located here: /scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.tigercpu.princeton.edu.lsawade.018103.0005

I hope this helps/resolves the issues.

andre-merzky commented 5 years ago

To warn everyone, I do believe that this is caused by error 40 (the error sits 40cm in front of the screen).

Well, tbh, I prefer Error 40 over errors in our stack - they are usually simpler to fix (for us) :-D So, don't worry about that! :-)

Alas, I can't access your sandbox:

$ cd /scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.tigercpu.princeton.edu.lsawade.018103.0005
-bash: cd: /scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.tigercpu.princeton.edu.lsawade.018103.0005: Permission denied

You should be able to open it with:

$ chmod a+rX /scratch/gpfs/lsawade/
$ chmod -R a+rX /scratch/gpfs/lsawade/radical.pilot.sandbox/

Thanks!

lsawade commented 5 years ago

That is true! From that point of view Error 40 is always the better error!

I ran the commands you pointed out! I though I ran chmod -R 777 on everything and that should run everything. Let me know if that worked!

andre-merzky commented 5 years ago

Thanks. FWIW, your sysadmins probably don't like a brute-force 0777 chmod - but yes, it worked :-)

I see this in the agent_0.err log:

ValueError: Not enough gpus available (4 < 8).

There seems to be a problem with the number of allocated nodes. The config files look correct:

$ grep -e cores -e gpus agent_0.cfg
    "cores": 28,
    "cores_per_node": 28,
    "gpus": 8,
    "gpus_per_node": 4,

You said that you ran solver.py - but that contains:

 43                 'gpus': 6,
 44                 'cpus': 6

which seems to be a different setup. Can you please check what exact res_dict you have been using, and I'll try to reproduce that issue. This seems, unfortunately to be a bug on our end :-P

andre-merzky commented 5 years ago

PS.: please remove the RADICAL_ENTK_VERBOSE line from your script, and set RADICAL_LOG_LVL=DEBUG in your environment, to obtain full log files. If the log disturb you when running the script, you can redirect them to a file by using RADICAL_LOG_TGT=rct.log.

lsawade commented 5 years ago

Ok, I will try that and get back to you!

lsawade commented 5 years ago

Ok, so, the resource dictionary is supposed to have the actual number of gpus and cpus of the nodes and then I request a specific amount from that pool at task level? Am I understanding this correctly?

andre-merzky commented 5 years ago

Ok, so, the resource dictionary is supposed to have the actual number of gpus and cpus of the nodes and then I request a specific amount from that pool at task level? Am I understanding this correctly?

not quite: the resource dict is supposed to capture the resources your workload needs, and our stack is supposed to figure out how many nodes to request to have the sufficient amount of resources. assume you need 4 cores and 16 GPUS - the stack should allocate 4 nodes (284 cores, 44 GPUs) to satisfy that, even if that means that some (well, many) cores will remain idle.

lsawade commented 5 years ago

Ok, I see so when I put

res_dict = {...
             'gpus': 6,
             'cpus': 6
             ...
             }

That is the correct amount I am requesting of the cluster. However, as of now, the stack has issues communicating this?

andre-merzky commented 5 years ago

Yes - thanks for confirming then, I'll try to run this.

lsawade commented 5 years ago

Just as a reference, all files concerning this pipeline can be found in the following repository:

simple_entk_specfem

mturilli commented 5 years ago

I left a couple of comments in two tickets at https://github.com/lsawade/simple_entk_specfem. Looking forward to discuss this later today in our meeting.

andre-merzky commented 5 years ago

@lsawade : can you please post again what radical-stack you are using? You should upgrade to 0.70.0 for all layers (which is the version uploaded to pypi - so a pip install radical.entk should give you that stack on a fresh virtualenv.

I am asking because I can't seem to reproduce that problem on the current stack :-/

mturilli commented 5 years ago

Princeton to provide a batch script to run the executable(s) of the workflow, independent from EnTK. This will facilitate the debugging of the reported issues.

lsawade commented 5 years ago

Hi @mturilli, for the initial problem, I had provided a batch script in th simple_entk_specfem repository. It's called submit_GPU.sh. I don't think, I mentioned it explicitly. Sorry, for the delay!

lsawade commented 5 years ago

Hi @andre-merzky,

are there any updates? I'm close to finishing my full EnTK pipeline and would like to get it up and running when I'm back in Princeton in the beginning of September. At the moment the pipeline is of course - sadly - dry and untested because I'm still on a boat and ssh-ing to the server to work is not as nice as I hoped. But most of my code is unit-tested, so the only barrier to get it running would be the EnTK-end of it.

Sorry that I'm of so little help!

andre-merzky commented 5 years ago

Our task execution script contains these lines:

cd /tigress/amerzky/specfem3d_globe ||  (echo "pre_exec failed"; false) || exit
module load intel/18.0/64/18.0.3.222 ||  (echo "pre_exec failed"; false) || exit
module load intel-mpi/intel/2018.3/64 ||  (echo "pre_exec failed"; false) || exit
module load cudatoolkit/8.0 ||  (echo "pre_exec failed"; false) || exit
ldd /tigress/amerzky/specfem3d_globe/bin/xspecfem3D ||  (echo "pre_exec failed"; false) || exit

the ldd command outputs :

        libmpi_usempi.so.40 => /usr/local/openmpi/3.1.3/gcc/x86_64/lib64/libmpi_usempi.so.40 (0x00002ab6f1a95000)
        libmpi_mpifh.so.40 => /usr/local/openmpi/3.1.3/gcc/x86_64/lib64/libmpi_mpifh.so.40 (0x00002ab6f1c98000)
        libmpi.so.40 => /usr/local/openmpi/3.1.3/gcc/x86_64/lib64/libmpi.so.40 (0x00002ab6f1eeb000)

which is unexpected and leads to a later MPI error when running the appliation code. This should point to the intel MPI module, IIUC. So we need to figure out why our stack results in a different lib resolution.

lsawade commented 5 years ago

Hi @andre-merzky, It seems like you are using older modules. Could that be an issue? I have been working with the ones defined here: Simple_Specfem_ENTK/submit_GPU.sh.

I fixed up the pipeline and the batchscript, and tested the batch script already and it runs. You just have to create a directory, where you run specfem, and things are copied into and then adapt the following line in either:

📝 -- I do not know whether this could be an issue or not. If you are 100% sure that this is not the issue, then forget about the above things. Again thanks for your efforts!

mturilli commented 4 years ago

GPU support will become critical starting from December 1

andre-merzky commented 4 years ago

AM to contact support and CC Jeroen.

lsawade commented 4 years ago

Hi everyone, So I have tried running a "Hello World" example, without any luck. Maybe there is a simple fix. I followed the EnTK installation from the link that was provided during the last meeting and ran the following script:

from radical.entk import Pipeline, Stage, Task, AppManager
import os

# ------------------------------------------------------------------------------
# Set default verbosity

if os.environ.get('RADICAL_ENTK_VERBOSE') == None:
    os.environ['RADICAL_ENTK_REPORT'] = 'True'

# Description of how the RabbitMQ process is accessible
# No need to change/set any variables if you installed RabbitMQ has a system
# process. If you are running RabbitMQ under a docker container or another
# VM, set "RMQ_HOSTNAME" and "RMQ_PORT" in the session where you are running
# this script.
hostname = os.environ.get('RMQ_HOSTNAME', 'localhost')
port = os.environ.get('RMQ_PORT', 5672)

if __name__ == '__main__':

    # Create a Pipeline object
    p = Pipeline()

    # Create a Stage object
    s = Stage()

    # Create a Task object
    t = Task()
    t.name = 'my-first-task'        # Assign a name to the task (optional, do not use ',' or '_')
    t.executable = '/bin/echo'   # Assign executable to the task
    t.arguments = ['Hello World']  # Assign arguments for the task executable
    t.download_output_data = ['STDOUT', 'STDERR']

    # Add Task to the Stage
    s.add_tasks(t)

    # Add Stage to the Pipeline
    p.add_stages(s)

    # Create Application Manager
    appman = AppManager(hostname=hostname, port=port)

    # Create a dictionary describe four mandatory keys:
    # resource, walltime, and cpus
    # resource is 'local.localhost' to execute locally
    res_dict = {

        'resource':  'princeton.tiger_cpu',
        'project' : 'geo',
        'queue'   : 'gpu',
        'schema'   : 'local',
        'walltime': 200,
        'cpus': 1
    }

    # Assign resource request description to the Application Manager
    appman.resource_desc = res_dict

    # Assign the workflow as a set or list of Pipelines to the Application Manager
    # Note: The list order is not guaranteed to be preserved
    appman.workflow = set([p])

    # Run the Application Manager
    appman.run()

But I get the following error message without having touched the keyboard (I checked several times)

EnTK session: re.session.tigergpu.princeton.edu.lsawade.018327.0006
Creating AppManagerSetting up RabbitMQ system                                 ok
                                                                              ok
Validating and assigning resource manager                                     ok
Setting up RabbitMQ system                                                   n/a
new session: [re.session.tigergpu.princeton.edu.lsawade.018327.0006]           \
database   : [mongodb://specfm:5p3cfm@two.radical-project.org:27017/specfm]   ok
create pilot manager                                                          ok
submit 1 pilot(s)
        [princeton.tiger_cpu:1]
                                                                              ok
closing session re.session.tigergpu.princeton.edu.lsawade.018327.0006          \
close pilot manager                                                            \
wait for 1 pilot(s)
              0                                                               ok
                                                                              ok
session lifetime: 18.8s                                                       ok
wait for 1 pilot(s)
              0                                                          timeout
All components terminated
Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in _submit_resource_request
    self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 535, in wait
    time.sleep(0.1)
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 414, in run
    self._rmgr._submit_resource_request()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 192, in _submit_resource_request
    raise KeyboardInterrupt
KeyboardInterrupt

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "get_started_head_node.py", line 63, in <module>
    appman.run()
  File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 439, in run
    raise KeyboardInterrupt
KeyboardInterrupt

The wall time is set to 200s, so the error must lie somewhere else.

The exported variables are the following:

# Radical Pilot verbose format
export RADICAL_PILOT_VERBOSE="REPORT"
export RADICAL_LOG_LVL="DEBUG"
export RADICAL_LOG_TGT="radical.log"

# Database resource
export RADICAL_PILOT_DBURL="mongodb://specfm:5p3cfm@two.radical-project.org:27017/specfm"

# RabbitMQ resource
export RMQ_HOSTNAME="two.radical-project.org"
export RMQ_PORT="33267"

Some ideas?

mturilli commented 4 years ago

Tiger proven to be unusable due to queue time. Closing for inactivity.