justanhduc / task-spooler

A scheduler for GPU/CPU tasks
https://justanhduc.github.io/2021/02/03/Task-Spooler.html
GNU General Public License v2.0
273 stars 24 forks source link

Separate logging and queueing? #34

Closed ShuyangCao closed 1 year ago

ShuyangCao commented 1 year ago

Thanks again for your work. Your tool helps me push out a lot of great work. Feel free to check out my website.

Recently, our workstation has unstable connection with the GPUs (might be an issue with the driver). Basically, nvidia-smi would return

Unable to determine the device handle for GPU 0000:4C:00.0: Unknown Error

When this issue occurs, the ts session will break down and restart. While the access to the previous session is lost, the jobs launched by the previous ts session are still running and we can no longer track their logging outputs with ts -t.

I guess it might be better to separate logging and queueing, so that the logging module does not depend on the GPU status and can still work when GPU error occurs.

Thanks!

justanhduc commented 1 year ago

Hey @ShuyangCao. Nicely done! Keep up your good works using ts!

In fact, logging is handled by the client, not the server, so you can still see your progress via offline log files in /tmp. -t/-c simply read from these files. You can sort the files to find the wanted log. When the server crashes, all information about the jobs is gone, so it's impossible to use -t/-c [jobid] anymore.

Above all, the crash should not happen at all. This is probably due to the poor error handling of GPU query. I pushed a simple fix for this in the branch gpu-err. Please check and see if the error can be handled or not.

Btw, there's a log file in /tmp which has the format socket-ts.<uid>.error. Could you please let me know the error message that the server gave when the query was unsuccessful?

ShuyangCao commented 1 year ago

Thanks! Yes, I can still check the files in /tmp, but I am not sure which job each file corresponds to.

The error messages are:

-------------------Error
 Msg: Error calling recv_msg in c_check_version
 errno 104, "Connection reset by peer"
date Tue Dec 20 12:38:25 2022
pid 82645
type CLIENT
-------------------Error
 Msg: Failed to get GPU handle for GPU 3: GPU is lost
 errno 2, "No such file or directory"
date Wed Dec 21 07:04:18 2022
pid 4351
type SERVER
justanhduc commented 1 year ago

Thanks for the reply @ShuyangCao. The first error probably is caused by a message from an orphan client sent to a restarted server. This is harmless in most cases. The second error will return a NULL pointer, which causes the server crash as you experienced. The patch I pushed should be able to handle this error.

Please let me know if there's still any problem.