Nextflow ignoring GPU limits

r0f1 commented 4 years ago

Hi, consider this following code:

// main.nf
process hello {
    executor "google-lifesciences"
    machineType "n1-highmem-4"
    accelerator 1, type: "nvidia-tesla-t4"
    container "eu.gcr.io/project/image:latest"
    containerOptions "--gpus all"
    """
    python my_gpu_application.py
    """
}

// nextflow.config
google {
    project = "project"
    region = "europe-west4"

    lifeSciences {
        bootDiskSize = "200 GB"
        debug = true
    }
}
docker {
    enabled = true
    runOptions = "--user='root'"
}

There is actually another process before process hello, that causes hello to be spawned several times in parallel. If this other process spawns only one instance of hello and therefore only one process hello is executed at a time, everything runs fine. If I add the directive maxForks 1 to hello, everything runs fine. However, if multiple hellos are run in parallel, I get an error from my python script, that there is not enough memory available.

How can I ensure that nextflow is scheduling exactly one instance of hello to one physical machine?

Version:

      N E X T F L O W
      version 20.04.1 build 5335
      created 03-05-2020 19:37 UTC
      cite doi:10.1038/nbt.3820
      http://nextflow.io

pditommaso commented 4 years ago

Google LS assigned spawn a separate VM for each job and therefore container. don't think it's related to that.

Moreover, note the docker and containerOptions are ignored by LS executor.

r0f1 commented 4 years ago

Thank you for your response. I think the reason for the error I am getting is related to the last sentence you wrote. I suspect that even though I am able to spawn a VM with a GPU with Google LS, the docker container is not able to use that GPU. I looked around but could not find an example of someone using a GPU with Google LS. Maybe I will write a short example to verify my hypothesis.

Or is there a way of passing the docker options and the containerOptions to Google LS? Or can you pass these arguments somehow inside the Dockerfile?

pditommaso commented 4 years ago

Maybe @moschetti @hnawar know more

hnawar commented 4 years ago

Hi Florian, I've tried create simple process that uses GPU from NextFlow, I can run nvidia-smi and can see the GPU. I tried to run a simple python code but ran into some errors due to missing cudatoolkit. I'll try to add that to my Dockerfile and see if I can get it tow work. Here is my current Dockerfile

from nvidia/cuda:10.2-base RUN apt-get update && apt-get install -y --no-install-recommends \ python3.5 \ python3-pip \ && \ apt-get clean && \ rm -rf /var/lib/apt/lists/* RUN pip3 install numpy matplotlib RUN pip3 install numba

hnawar commented 4 years ago

Actually it now worked, I just had to use the nvidia/cuda:latest instead of the 10.2-base and here is my output which is a modification on an online mini benchmark of CPU v GPU without GPU: 10.049912896000023 with GPU: 3.557580697999981

r0f1 commented 4 years ago

Thanks for your help. I really appreciate the fast response and helpful comments. I got it to work now, but honestly I don't know what actually did the trick. I am listing a bunch of changes that I did for future googlers. My dockerfile now uses nvidia/cuda:10.0-cudnn7-devel. I am using AlexeyAB's darknet which I compile inside the dockerfile, but the compilation now takes place in a separate docker image in the same file and the results are then copied over. In my nextflow.config I defined a label and specified machineType = "n1-standard-8" under that label and in my process definition I specified cpu 8 so that nextflow is forced to use separate machines. The docker options and containerOptions I left unchanged. Thanks!

hnawar commented 4 years ago

Just a quick note, I ran the above with Google Life Sciences Executor and not the docker executor. With the docker option it will try to run multiple processes on the same machine to maximise the CPU usage.

nextflow-io / nextflow

Nextflow ignoring GPU limits #1627