Closed nareshov closed 11 years ago
Confirming bug. Dead lock is due to an unhandled exception from softlayer_objectstorage in the upload threads causing them to exit out while the main loop waits for work to finish. Since the workers died, the main loop will wait forever.
Do you get disconnected a lot?
Do you get disconnected a lot?
I haven't had many runs of this script yet, but so far, it looks like half my runs are getting stuck.
Would you mind trying the issue-7 branch?
Sure: I tried it concurrently from three different hosts. During the first run, 2 out of 3 failed: https://gist.github.com/20a2a5889e45a5becf14 (noticed `ps -ef | grep slbackup' show all of them in defunct state so I Ctrl-C'd them myself.)
I retried on the two hosts and one of them succeeded while the other failed: https://gist.github.com/5d40dc2c28e0123e4bdc (didn't Ctrl+C this time, those messages showed up by itself)
On a related note, are we hitting any limitations such as the number of object-storage calls that can be made from a server or servers belonging to an account?
I was finally able to produce a disconnect using 200 threads uploading the linux kernel tarball. Don't see the "requeuing" message in our output though like this. The exception will still show up for now, which I will squash as soon as I know this issue is resolved.
I went ahead and merged it into master if you want to grab the latest version from master.
Did getting the master branch of this help you any ?
Hey,
I've just deployed the master branch, I'll keep an eye and notify in case I see issues. On the plus side, with the issue-7 branch's slbackup.py, in the past five days, the process hasn't remained in a defunct/hung state for more than a day (setup as a daily cron).
I'll leave this open for a week then close it. Good to hear!
Here's a traceback captured by cron: https://gist.github.com/08e637f8550fa14e2a12
This happens sometimes -- any more information I could furnish you with?
Thanks