hammerlab / ketrew

Keep Track of Experimental Workflows
http://www.hammerlab.org/docs/ketrew/master/index.html
Apache License 2.0
77 stars 10 forks source link

Lwt-async-exn: Unix.ENOMEM error #462

Open arahuja opened 8 years ago

arahuja commented 8 years ago

Very consistently seeing this when submitting many epidisco pipelines at once. It tends to happen when they all simultaneously hit the parallelized variant calling stage.

Lwt-async-exn: Unix.Unix_error(Unix.ENOMEM, "fork", "")

The danger here is that I have to restart the server while tasks are in progress. Many tasks create their output file before they are completed, which means that if I resubmit the task it might be viewed as done and move on to the next stage using incomplete output. (Perhaps this is only true if I remove the database, which I often do since after the submitting ~50 or 1000 nodes jobs the 'Building' stage a job is pretty long. Usually, there is a pretty long lag between submitting the task (and observing the submission in the in the log at the bottom of the UI) and see it display as 'Building' in the table)

smondet commented 8 years ago

What else was on the machine?

I had that one before but Ketrew wasn't the actual problem; other jobs in the same host had exhausted the memory (picard-mark-dups / mutect).