Closed lsawade closed 4 years ago
To warn everyone, I do believe that this is caused by error 40 (the error sits 40cm in front of the screen).
Well, tbh, I prefer Error 40
over errors in our stack - they are usually simpler to fix (for us) :-D So, don't worry about that! :-)
Alas, I can't access your sandbox:
$ cd /scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.tigercpu.princeton.edu.lsawade.018103.0005
-bash: cd: /scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.tigercpu.princeton.edu.lsawade.018103.0005: Permission denied
You should be able to open it with:
$ chmod a+rX /scratch/gpfs/lsawade/
$ chmod -R a+rX /scratch/gpfs/lsawade/radical.pilot.sandbox/
Thanks!
That is true! From that point of view Error 40
is always the better error!
I ran the commands you pointed out! I though I ran chmod -R 777
on everything and that should run everything. Let me know if that worked!
Thanks. FWIW, your sysadmins probably don't like a brute-force 0777
chmod - but yes, it worked :-)
I see this in the agent_0.err
log:
ValueError: Not enough gpus available (4 < 8).
There seems to be a problem with the number of allocated nodes. The config files look correct:
$ grep -e cores -e gpus agent_0.cfg
"cores": 28,
"cores_per_node": 28,
"gpus": 8,
"gpus_per_node": 4,
You said that you ran solver.py
- but that contains:
43 'gpus': 6,
44 'cpus': 6
which seems to be a different setup. Can you please check what exact res_dict
you have been using, and I'll try to reproduce that issue. This seems, unfortunately to be a bug on our end :-P
PS.: please remove the RADICAL_ENTK_VERBOSE
line from your script, and set RADICAL_LOG_LVL=DEBUG
in your environment, to obtain full log files. If the log disturb you when running the script, you can redirect them to a file by using RADICAL_LOG_TGT=rct.log
.
Ok, I will try that and get back to you!
Ok, so, the resource dictionary is supposed to have the actual number of gpus and cpus of the nodes and then I request a specific amount from that pool at task level? Am I understanding this correctly?
Ok, so, the resource dictionary is supposed to have the actual number of gpus and cpus of the nodes and then I request a specific amount from that pool at task level? Am I understanding this correctly?
not quite: the resource dict is supposed to capture the resources your workload needs, and our stack is supposed to figure out how many nodes to request to have the sufficient amount of resources. assume you need 4 cores and 16 GPUS - the stack should allocate 4 nodes (284 cores, 44 GPUs) to satisfy that, even if that means that some (well, many) cores will remain idle.
Ok, I see so when I put
res_dict = {...
'gpus': 6,
'cpus': 6
...
}
That is the correct amount I am requesting of the cluster. However, as of now, the stack has issues communicating this?
Yes - thanks for confirming then, I'll try to run this.
Just as a reference, all files concerning this pipeline
can be found in the following repository:
I left a couple of comments in two tickets at https://github.com/lsawade/simple_entk_specfem. Looking forward to discuss this later today in our meeting.
@lsawade : can you please post again what radical-stack
you are using? You should upgrade to 0.70.0
for all layers (which is the version uploaded to pypi - so a pip install radical.entk
should give you that stack on a fresh virtualenv.
I am asking because I can't seem to reproduce that problem on the current stack :-/
Princeton to provide a batch script to run the executable(s) of the workflow, independent from EnTK. This will facilitate the debugging of the reported issues.
Hi @mturilli, for the initial problem, I had provided a batch script in th simple_entk_specfem repository. It's called submit_GPU.sh
. I don't think, I mentioned it explicitly. Sorry, for the delay!
Hi @andre-merzky,
are there any updates? I'm close to finishing my full EnTK pipeline and would like to get it up and running when I'm back in Princeton in the beginning of September. At the moment the pipeline is of course - sadly - dry and untested because I'm still on a boat and ssh-ing to the server to work is not as nice as I hoped. But most of my code is unit-tested, so the only barrier to get it running would be the EnTK-end of it.
Sorry that I'm of so little help!
Our task execution script contains these lines:
cd /tigress/amerzky/specfem3d_globe || (echo "pre_exec failed"; false) || exit
module load intel/18.0/64/18.0.3.222 || (echo "pre_exec failed"; false) || exit
module load intel-mpi/intel/2018.3/64 || (echo "pre_exec failed"; false) || exit
module load cudatoolkit/8.0 || (echo "pre_exec failed"; false) || exit
ldd /tigress/amerzky/specfem3d_globe/bin/xspecfem3D || (echo "pre_exec failed"; false) || exit
the ldd command outputs :
libmpi_usempi.so.40 => /usr/local/openmpi/3.1.3/gcc/x86_64/lib64/libmpi_usempi.so.40 (0x00002ab6f1a95000)
libmpi_mpifh.so.40 => /usr/local/openmpi/3.1.3/gcc/x86_64/lib64/libmpi_mpifh.so.40 (0x00002ab6f1c98000)
libmpi.so.40 => /usr/local/openmpi/3.1.3/gcc/x86_64/lib64/libmpi.so.40 (0x00002ab6f1eeb000)
which is unexpected and leads to a later MPI error when running the appliation code. This should point to the intel MPI module, IIUC. So we need to figure out why our stack results in a different lib resolution.
Hi @andre-merzky, It seems like you are using older modules. Could that be an issue? I have been working with the ones defined here: Simple_Specfem_ENTK/submit_GPU.sh.
I fixed up the pipeline and the batchscript, and tested the batch script already and it runs. You just have to create a directory, where you run specfem, and things are copied into and then adapt the following line in either:
...
# Change to your specfem run directory
'cd /home/lsawade/specfem_run',
...
to change the directory that you specify, or
...
# Define your specfem run directory
cd /home/lsawade/specfem_run
...
📝 -- I do not know whether this could be an issue or not. If you are 100% sure that this is not the issue, then forget about the above things. Again thanks for your efforts!
GPU support will become critical starting from December 1
AM to contact support and CC Jeroen.
Hi everyone, So I have tried running a "Hello World" example, without any luck. Maybe there is a simple fix. I followed the EnTK installation from the link that was provided during the last meeting and ran the following script:
from radical.entk import Pipeline, Stage, Task, AppManager
import os
# ------------------------------------------------------------------------------
# Set default verbosity
if os.environ.get('RADICAL_ENTK_VERBOSE') == None:
os.environ['RADICAL_ENTK_REPORT'] = 'True'
# Description of how the RabbitMQ process is accessible
# No need to change/set any variables if you installed RabbitMQ has a system
# process. If you are running RabbitMQ under a docker container or another
# VM, set "RMQ_HOSTNAME" and "RMQ_PORT" in the session where you are running
# this script.
hostname = os.environ.get('RMQ_HOSTNAME', 'localhost')
port = os.environ.get('RMQ_PORT', 5672)
if __name__ == '__main__':
# Create a Pipeline object
p = Pipeline()
# Create a Stage object
s = Stage()
# Create a Task object
t = Task()
t.name = 'my-first-task' # Assign a name to the task (optional, do not use ',' or '_')
t.executable = '/bin/echo' # Assign executable to the task
t.arguments = ['Hello World'] # Assign arguments for the task executable
t.download_output_data = ['STDOUT', 'STDERR']
# Add Task to the Stage
s.add_tasks(t)
# Add Stage to the Pipeline
p.add_stages(s)
# Create Application Manager
appman = AppManager(hostname=hostname, port=port)
# Create a dictionary describe four mandatory keys:
# resource, walltime, and cpus
# resource is 'local.localhost' to execute locally
res_dict = {
'resource': 'princeton.tiger_cpu',
'project' : 'geo',
'queue' : 'gpu',
'schema' : 'local',
'walltime': 200,
'cpus': 1
}
# Assign resource request description to the Application Manager
appman.resource_desc = res_dict
# Assign the workflow as a set or list of Pipelines to the Application Manager
# Note: The list order is not guaranteed to be preserved
appman.workflow = set([p])
# Run the Application Manager
appman.run()
But I get the following error message without having touched the keyboard (I checked several times)
EnTK session: re.session.tigergpu.princeton.edu.lsawade.018327.0006
Creating AppManagerSetting up RabbitMQ system ok
ok
Validating and assigning resource manager ok
Setting up RabbitMQ system n/a
new session: [re.session.tigergpu.princeton.edu.lsawade.018327.0006] \
database : [mongodb://specfm:5p3cfm@two.radical-project.org:27017/specfm] ok
create pilot manager ok
submit 1 pilot(s)
[princeton.tiger_cpu:1]
ok
closing session re.session.tigergpu.princeton.edu.lsawade.018327.0006 \
close pilot manager \
wait for 1 pilot(s)
0 ok
ok
session lifetime: 18.8s ok
wait for 1 pilot(s)
0 timeout
All components terminated
Traceback (most recent call last):
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 179, in _submit_resource_request
self._pilot.wait([rp.PMGR_ACTIVE, rp.DONE, rp.FAILED, rp.CANCELED])
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/pilot/compute_pilot.py", line 535, in wait
time.sleep(0.1)
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 414, in run
self._rmgr._submit_resource_request()
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/execman/rp/resource_manager.py", line 192, in _submit_resource_request
raise KeyboardInterrupt
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "get_started_head_node.py", line 63, in <module>
appman.run()
File "/home/lsawade/.conda/envs/ve-entk/lib/python3.7/site-packages/radical/entk/appman/appmanager.py", line 439, in run
raise KeyboardInterrupt
KeyboardInterrupt
The wall time is set to 200s, so the error must lie somewhere else.
The exported variables are the following:
# Radical Pilot verbose format
export RADICAL_PILOT_VERBOSE="REPORT"
export RADICAL_LOG_LVL="DEBUG"
export RADICAL_LOG_TGT="radical.log"
# Database resource
export RADICAL_PILOT_DBURL="mongodb://specfm:5p3cfm@two.radical-project.org:27017/specfm"
# RabbitMQ resource
export RMQ_HOSTNAME="two.radical-project.org"
export RMQ_PORT="33267"
Some ideas?
Tiger proven to be unusable due to queue time. Closing for inactivity.
Hi,
As @mturilli suggested (Here #82 ), I'll open a new ticket fo the more specific GPU problem.
To warn everyone, I do believe that this is caused by error 40 (the error sits 40cm in front of the screen).
First of all, the necessary locations:
Specfem GPU installation
/tigress/lsawade/specfem3d_globe
; the installation runs successfully using slurm.the EnTK directory from where I ('m trying to) run things
/tigress/lsawade/entk_testing
solver.py
in which GPU resources are requested. I think this possibly the source of the error.If you decide to compile specfem with GPU support by yourself again, I suggest the following sequence of commands from within the specfem directory to compile the code.
The resource slurm script that I used as a reference is located here
I gave permissions to access the above directories and am going to keep my hands off of them for now. The session I'm referring to is located here:
/scratch/gpfs/lsawade/radical.pilot.sandbox/re.session.tigercpu.princeton.edu.lsawade.018103.0005
I hope this helps/resolves the issues.