justanhduc / task-spooler

A scheduler for GPU/CPU tasks
https://justanhduc.github.io/2021/02/03/Task-Spooler.html
GNU General Public License v2.0
273 stars 24 forks source link

ts -F stochastically crashes the server #37

Closed orsharir closed 1 year ago

orsharir commented 1 year ago

Occasionally (once every 50-100 commands), the task spooler's server will crash following a ts -F <id> command. I'm not fully certain if it's only that command, but maybe because I'm using it quite often so it's usually after calling ts -F that I suddenly see the server has crashed. It happens silently, so I don't have much more to add. Looking at /tmp/socket.ts.error, I see the following warnings:

-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:23 2023
pid 12532
type CLIENT
-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:27 2023
pid 12528
type CLIENT
-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:29 2023
pid 12534
type CLIENT
-------------------Warning
 Msg: Sending a message to 3, sent -1 bytes, should send 48.
 errno 32, "Broken pipe"
msgdump:
 ENDJOB
date Thu Feb  2 13:31:33 2023
pid 12530
type CLIENT

I was previously using a version installed from source based on a November commit, but the behavior is the same even on the latest commit on the main branch. The only thing that changed on my environment is a recent system upgrade (on an NVIDIA DGX machine), so perhaps this is due to a newer NVIDIA drivers or some other updated package this program relies on?

BTW, this is a wonderful program that is integral to my daily work as a researcher. It's the just right level of task management that I needed.

justanhduc commented 1 year ago

Hi @orsharir. Thanks for reporting this. I fixed this problem some time ago but forgot to push it. The problem was due to an overflow of string. The fix is on my laptop at home so I will push it as soon as tonight.

justanhduc commented 1 year ago

@orsharir I pushed the fix. Please check and let me know if it works!

justanhduc commented 1 year ago

fixed via 3ff7e680.