Open Shellfishgene opened 2 years ago
Hi @Shellfishgene , thanks for your interest in the pipeline!
All should happen inside the container: but it seems there is some issue with the Singularity container version for GARD+MPI. I will try to look into it asap
I guess you have no way on your cluster to run the Docker profile?
No Docker on the cluster, I can run it on a workstation though. It's not urgent anyway... Thanks for having a look!
Getting similar problem, different log with singularity:
Failed to create a completion queue (CQ):
Hostname: endeavour2 Requested CQE: 16384 Error: Cannot allocate memory
Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component in this run.
Failed to create a completion queue (CQ):
Hostname: endeavour2 Requested CQE: 16384 Error: Cannot allocate memory
Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system.
You job will continue, but Open MPI will ignore the "ud" oob component in this run.
No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.
[ERROR] This analysis requires an MPI environment to run
[endeavour2.hpc.usc.edu:161337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
Hey @Shellfishgene!
Am I understanding it right that this issue occured to you when you were running poseidon
on your local machine with the singularity
profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta
.
Did you try to just run the pipeline again or with the -resume
flag turned on? Also are you running the latest release of poseidon
?
Hi!
I just tried to run the pipeline with profile local and singularity, with the test data
bats_mx1_small.fasta
. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This isgard.log
:
Hey @Shellfishgene! Am I understanding it right that this issue occured to you when you were running
poseidon
on your local machine with thesingularity
profile? Because then I can't seem to recreate it. Runs fine for me withbats_mx1_small.fasta
.
I figured out what the problem was: I forgot to set the local
profile in Nextflow, and ran it with -profile singularity --cores 4
. However that seems to set ${task.cpus}
to 1 for the gard task, and mpirun -np 1
causes the error. It needs to be >1. The error message from mpirun is not exacly clear... With -profile local,singularity
it works.
@Shellfishgene ah great, thanks for letting us know!
So it seems that when no "execution" profile is defined, the default core number as defined here: https://github.com/hoelzer/poseidon/blob/master/nextflow.config#L15
is not distributed to the processes.
With -profile local,singularity
the default value is passed to the GARD process:
https://github.com/hoelzer/poseidon/blob/master/configs/local.config#L14
@fischer-hub maybe we can just add a check to the poseidon.nf
that the task.cpus
must be >1?
@hoelzer Yes good idea probably, I also ran into some other issues with the gard process when running with --profile slurm,singularity
, might as well fix all of that together!
Hi!
I just tried to run the pipeline with profile local and singularity, with the test data
bats_mx1_small.fasta
. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This isgard.log
: