Markdup step TimeOut exiting with IHEC usage

paulstretenowich commented 4 years ago

Hi,

I'm using the pipeline as part of IHEC, with the following versions:

Singularity version 3.2.1-1.el7
IHEC Fork of grape-nf
Nextflow version 19.04.0.5069

The pipeline itself is running well except the mergeBam step (not always working). When it comes to the markdup step it's taking very long time to end (I tried allowing up to 3 days) and ending with TIMEOUT. I noticed that the sambamba cmd is started but "stuck" and using 0% CPU (monitoring with htop). Then, I tried the sambamba cmd defined inside .command.sh outside the container (it worked) and inside the container (it worked too). I don't know what's happening there. If you need me to add some logs please tell me.

Thanks, Paul

emi80 commented 4 years ago

Hi Paul,

apologies for the late reply.

About mergeBam not always working, I suspect it is related to the open issue #48. I am going to fix it soon.

About markDup, it would be useful to get the process logs and Nextflow logs for the pipeline run. Another test you could do, would be to manually launch the .command.run script from within the process folder and see if that works. Also, if you run the pipeline again does the problem still arise? It looks like a weird behavior...

Best, Emilio

paulstretenowich commented 4 years ago

Hi Emilio,

Yes, the mergeBam issue is related to #48, that's why I haven't given you much more information about.

About markDup, if I run .command.run out of the pipeline I have the same issue. If I run .command.sh it's working well. If I re-run the pipeline I also have the same issue, however, sometimes it works without changing anything. Here are the logs from a run: nextflow.log command.err.txt command.log command.run.txt command.sh.txt

Thanks, Paul

emi80 commented 4 years ago

HI Paul,

I could not find anything useful in the logs.

Did you run the .command.run and .command.sh locally or submitted via slurm? I am wondering whether the problem is running within a submitted job vs locally or it is the .command.run script that has some incompatibilities with your system.

Best, Emilio

paulstretenowich commented 4 years ago

Hi Emilio,

When I use the pipeline I run it with slurm but when I tried manually it was without slurm. In both cases it was with the singularity image. Running .command.sh either inside the container or outside worked but when it comes to run .command.run I have the timeout/0% CPU usage issue with or without slurm.

Thanks, Paul

emi80 commented 4 years ago

Hi Paul,

could you please try running the pipeline with the included small test dataset and the markdup profile? E.g.:

nextflow run grape-nf -profile markdup -with-singularity

Does the problem occur also in this case?

Best, Emilio

paulstretenowich commented 4 years ago

Hi Emilio,

Testing with markdup profile on test dataset worked without issue.

Thanks, Paul

emi80 commented 4 years ago

Hi Paul,

thanks, that does not help much.

I just realized you are not using the latest version of Nextflow. Any chance you can make a test using that version?

nextflow -self-update

      N E X T F L O W
      version 19.10.0 build 5170
      created 21-10-2019 15:07 UTC (17:07 CEST)
      cite doi:10.1038/nbt.3820
      http://nextflow.io

If the problem persists, I would then suggest you to run the hanging job via .command.run and inspect the process tree to see what's going on. You could use top or ps to check. In case you need help, please just send me the output of the ps -faux command.

Another test would be to run the pipeline adding trace.enabled = false to your local nextflow.config file and see whether the problems comes from that. I am not sure that's the case as the test dataset runs without issues.

Best, Emilio

paulstretenowich commented 4 years ago

Hi Emilio,

After updating nextflow it seems to solve the issue only if I run it locally, when I'm using slurm I still have the samne issue. EDIT: The update solved the issue for 2 samples but for the other 2 even locally the issue remains.

You can find what's going on at the markdup step on the htop screenshot attached if that helps and here is the corresponding ps-faux.txt output.

Changing the value of trace.enabled to false doesn't change anything...

Thanks, Paul

emi80 commented 4 years ago

Hi Paul,

it's a weird issue and it's hard to tell what's the cause. Maybe removing some of the complexity would help. Could you try running it without Singularity (e.g. with environment modules or conda)?

The best would be to find a minimal dataset for which we can reproduce the issue.

Best, Emilio

paulstretenowich commented 4 years ago

Hi Emilio,

Just to update you, I'm installing all the tools required for the pipeline to run and I will test without using singularity as you suggested. I will tell you if that changes anything with the issue.

Thanks, Paul

emi80 commented 4 years ago

Hi Paul,

any news regarding this issue?

Best, Emilio

paulstretenowich commented 4 years ago

Hi Emilio,

I moved to another cluster and that specific issue is not happening. It might be related to the infrastructure of the first cluster I tried. I'm waiting for an update of the file system and I will test again hoping it'll work then. I'll keep you posted on that.

On the other cluster I'm running the pipeline the only remaining issue is the mergeBam which you are fixing.

Thanks, Paul

emi80 commented 4 years ago

Hi Paul,

thanks for the update.

I'm closing this for now. Please feel free to reopen it again after the file system update if needed.

Best, Emilio

guigolab / grape-nf

Markdup step TimeOut exiting with IHEC usage #53