GARD fails due to MPI setup (?)

Shellfishgene commented 2 years ago

Hi!

I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This is gard.log:

libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: host
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   1023
Requested max number of SGEs in a WR in the RQ:                   1023
Requested max number of data that can be posted inline to the SQ: 0
Error:    File exists

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: host
--------------------------------------------------------------------------
libi40iw-i40iw_ucreate_qp: failed to create QP, unsupported QP type: 0x4
--------------------------------------------------------------------------
Failed to create a queue pair (QP):

Hostname: host
Requested max number of outstanding WRs in the SQ:                1
Requested max number of outstanding WRs in the RQ:                2
Requested max number of SGEs in a WR in the SQ:                   1023
Requested max number of SGEs in a WR in the RQ:                   1023
Requested max number of data that can be posted inline to the SQ: 0
Error:    File exists

Check requested attributes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI has detected that there are UD-capable Verbs devices on your
system, but none of them were able to be setup properly.  This may
indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component
in this run.

Hostname: host
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            host
  Device name:           i40iw0
  Device vendor ID:      0x8086
  Device vendor part ID: 14290

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           host
  Local device:         i40iw0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------------------------------
[ERROR] This analysis requires an MPI environment to run

[host:1017209] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[host:1017209] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[host:1017209] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

hoelzer commented 2 years ago

Hi @Shellfishgene , thanks for your interest in the pipeline!

All should happen inside the container: but it seems there is some issue with the Singularity container version for GARD+MPI. I will try to look into it asap

I guess you have no way on your cluster to run the Docker profile?

Shellfishgene commented 2 years ago

No Docker on the cluster, I can run it on a workstation though. It's not urgent anyway... Thanks for having a look!

mchaisso commented 2 years ago

Getting similar problem, different log with singularity:

Failed to create a completion queue (CQ):

Hostname: endeavour2 Requested CQE: 16384 Error: Cannot allocate memory

Check the CQE attribute.

Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component in this run.

Hostname: endeavour2

Failed to create a completion queue (CQ):

Hostname: endeavour2 Requested CQE: 16384 Error: Cannot allocate memory

Check the CQE attribute.

Open MPI has detected that there are UD-capable Verbs devices on your system, but none of them were able to be setup properly. This may indicate a problem on this system.

You job will continue, but Open MPI will ignore the "ud" oob component in this run.

Hostname: endeavour2

No OpenFabrics connection schemes reported that they were able to be used on a specific port. As such, the openib BTL (OpenFabrics support) will be disabled for this port.

Local host: endeavour2 Local device: mlx4_0 Local port: 1 CPCs attempted: udcm

[ERROR] This analysis requires an MPI environment to run

[endeavour2.hpc.usc.edu:161337] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

fischer-hub commented 2 years ago

Hey @Shellfishgene! Am I understanding it right that this issue occured to you when you were running poseidon on your local machine with the singularity profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta. Did you try to just run the pipeline again or with the -resume flag turned on? Also are you running the latest release of poseidon?

Hi!

I just tried to run the pipeline with profile local and singularity, with the test data bats_mx1_small.fasta. However GARD fails, apparently due some MPI setup stuff. I'm not sure if that should all happen in the container or if I have to deal with configuring that to run on the server/cluster? This is gard.log:

Shellfishgene commented 2 years ago

Hey @Shellfishgene! Am I understanding it right that this issue occured to you when you were running poseidon on your local machine with the singularity profile? Because then I can't seem to recreate it. Runs fine for me with bats_mx1_small.fasta.

I figured out what the problem was: I forgot to set the local profile in Nextflow, and ran it with -profile singularity --cores 4. However that seems to set ${task.cpus} to 1 for the gard task, and mpirun -np 1 causes the error. It needs to be >1. The error message from mpirun is not exacly clear... With -profile local,singularity it works.

hoelzer commented 2 years ago

@Shellfishgene ah great, thanks for letting us know!

So it seems that when no "execution" profile is defined, the default core number as defined here: https://github.com/hoelzer/poseidon/blob/master/nextflow.config#L15

is not distributed to the processes.

With -profile local,singularity the default value is passed to the GARD process: https://github.com/hoelzer/poseidon/blob/master/configs/local.config#L14

@fischer-hub maybe we can just add a check to the poseidon.nf that the task.cpus must be >1?

fischer-hub commented 2 years ago

@hoelzer Yes good idea probably, I also ran into some other issues with the gard process when running with --profile slurm,singularity, might as well fix all of that together!

rnajena / poseidon