softlayer / softlayer-object-storage-backup

Other
11 stars 3 forks source link

slbackup script hangs up sometimes in defunct state #7

Closed nareshov closed 11 years ago

nareshov commented 11 years ago

Here's a traceback captured by cron: https://gist.github.com/08e637f8550fa14e2a12

This happens sometimes -- any more information I could furnish you with?

Thanks

CrackerJackMack commented 11 years ago

Confirming bug. Dead lock is due to an unhandled exception from softlayer_objectstorage in the upload threads causing them to exit out while the main loop waits for work to finish. Since the workers died, the main loop will wait forever.

Do you get disconnected a lot?

nareshov commented 11 years ago

Do you get disconnected a lot?

I haven't had many runs of this script yet, but so far, it looks like half my runs are getting stuck.

CrackerJackMack commented 11 years ago

Would you mind trying the issue-7 branch?

nareshov commented 11 years ago

Sure: I tried it concurrently from three different hosts. During the first run, 2 out of 3 failed: https://gist.github.com/20a2a5889e45a5becf14 (noticed `ps -ef | grep slbackup' show all of them in defunct state so I Ctrl-C'd them myself.)

I retried on the two hosts and one of them succeeded while the other failed: https://gist.github.com/5d40dc2c28e0123e4bdc (didn't Ctrl+C this time, those messages showed up by itself)

nareshov commented 11 years ago

On a related note, are we hitting any limitations such as the number of object-storage calls that can be made from a server or servers belonging to an account?

CrackerJackMack commented 11 years ago

I was finally able to produce a disconnect using 200 threads uploading the linux kernel tarball. Don't see the "requeuing" message in our output though like this. The exception will still show up for now, which I will squash as soon as I know this issue is resolved.

https://gist.github.com/3892593

CrackerJackMack commented 11 years ago

I went ahead and merged it into master if you want to grab the latest version from master.

CrackerJackMack commented 11 years ago

Did getting the master branch of this help you any ?

nareshov commented 11 years ago

Hey,

I've just deployed the master branch, I'll keep an eye and notify in case I see issues. On the plus side, with the issue-7 branch's slbackup.py, in the past five days, the process hasn't remained in a defunct/hung state for more than a day (setup as a daily cron).

CrackerJackMack commented 11 years ago

I'll leave this open for a week then close it. Good to hear!