nimbusproject / nimbus

Nimbus - Open Source Cloud Computing Software - 100% Apache2 licensed
http://www.nimbusproject.org/
197 stars 82 forks source link

lantorrent corruption #81

Closed bc-umigs closed 12 years ago

bc-umigs commented 12 years ago

We have noticed an issue with lantorrent that causes it to hang and become unresponsive. The situation that causes this can be reproduced by repeating the following steps.

-Start up an instance/cluster either using the cloud-client or ec2 api. -Issue a terminate before the propagation has completed.

After this occurs the database is left with the entries as stale records, which never seem to get updated. If this occurs either after a large termination or after repeated executions of the above process, lantorrent stops copying any new requests. Looking at the lantorrent daemon log shows this error:

2011-11-21 11:37:27,422 - WARNING - Stack trace 2011-11-21 11:37:27,423 - WARNING - =========== 2011-11-21 11:37:27,423 - WARNING - Traceback (most recent call last): File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 114, in send self._write_to_socket(data) File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 99, in _write_to_socket self.socket.sendall(data) File "", line 1, in sendall error: [Errno 104] Connection reset by peer

2011-11-21 11:37:27,423 - WARNING - =========== 2011-11-21 11:37:27,423 - WARNING - <class 'socket.error'>

If we clear out the req.db file and restart lantorrent, all seems to be well again. This is occurring in our production environment of 2.7, as well as devel which is at 2.8.

Please let me know what other information you need in order to help troubleshoot this.

Thanks, Brian

labisso commented 12 years ago

Thanks for the report. We will check this out.

buzztroll commented 12 years ago

This issue sounds a lot like: https://github.com/nimbusproject/nimbus/issues/39, which should be solved in 2.8. So the behavior you are describing is expected on the 2.7 install but not the 2.8 one. Would it be possible to see more of your log? I was expecting to see a line with the string "send error " in it.

igs-jeff commented 12 years ago

patched both daemon.py & client.py.

cleared all lantorrent log files

launched serveral instances & terminated all instances immediately.

only ltrequest.log thus far, no ltdaemon.log yet, in ltrequest, all I get is

2011-12-08 12:01:55,863 - INFO - checking for done on f2b710e6-21b9-11e1-b251-e41f13b8e0b8 2011-12-08 12:01:55,871 - INFO - checking for done on f251d744-21b9-11e1-acc8-e41f13b8ed90 2011-12-08 12:01:55,875 - INFO - checking for done on f29331c6-21b9-11e1-b356-e41f13b8968c 2011-12-08 12:01:55,933 - INFO - enter 2011-12-08 12:01:55,944 - INFO - checking for done on f24fabe0-21b9-11e1-a542-e41f13b8f7f0 2011-12-08 12:01:55,956 - INFO - enter 2011-12-08 12:01:55,967 - INFO - checking for done on f28d39ec-21b9-11e1-94c8-e41f13b8df98 2011-12-08 12:02:27,470 - INFO - enter 2011-12-08 12:02:27,473 - INFO - enter 2011-12-08 12:02:27,475 - INFO - enter 2011-12-08 12:02:27,481 - INFO - enter 2011-12-08 12:02:27,481 - INFO - enter 2011-12-08 12:02:27,482 - INFO - checking for done on f2aacfa2-21b9-11e1-b8ba-e41f13b877c4 2011-12-08 12:02:27,487 - INFO - checking for done on f29331c6-21b9-11e1-b356-e41f13b8968c 2011-12-08 12:02:27,488 - INFO - checking for done on f2419dca-21b9-11e1-818d-e41f13b8f2cc 2011-12-08 12:02:27,494 - INFO - checking for done on f251d744-21b9-11e1-acc8-e41f13b8ed90 2011-12-08 12:02:27,497 - INFO - enter 2011-12-08 12:02:27,512 - INFO - checking for done on f28d39ec-21b9-11e1-94c8-e41f13b8df98 2011-12-08 12:02:27,516 - INFO - checking for done on f24fabe0-21b9-11e1-a542-e41f13b8f7f0 2011-12-08 12:02:27,530 - INFO - enter 2011-12-08 12:02:27,542 - INFO - enter 2011-12-08 12:02:27,545 - INFO - checking for done on f2b710e6-21b9-11e1-b251-e41f13b8e0b8

ps & grep lant on all VMMs, showed 3 lantorrent python process per instance.

stop and restart lantorrent ...

now, ltdaemon.log shows the exact error as you saw in the 1st post: here's it again:

2011-12-08 12:08:20,959 - WARNING - send error 506 172.20.101.11:2893[{u'rename': True, u'id': u'f251d744-21b9-11e1-acc8-e41f13b8ed90', u'filename': u'/secureimages/wrksp-180/tmpSRa4FbRepoVMS7ca46e32-eaad-11e0-89f3-0019bb33e0ee__clovr-standard-2011-07-01-03-00-31.raw'}] A connection error occured on send 172.20.101.11:2893 [Errno 104] Connection reset by peer

2011-12-08 12:08:20,959 - WARNING - Stack trace 2011-12-08 12:08:20,959 - WARNING - =========== 2011-12-08 12:08:20,959 - WARNING - Traceback (most recent call last): File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 114, in send self._write_to_socket(data) File "build/bdist.linux-x86_64/egg/pylantorrent/ltConnection.py", line 99, in _write_to_socket self.socket.sendall(data) File "", line 1, in sendall error: [Errno 104] Connection reset by peer

2011-12-08 12:08:20,960 - WARNING - =========== 2011-12-08 12:08:20,960 - WARNING - <class 'socket.error'> 2011-12-08 12:08:20,960 - WARNING - bad data: {"code": 503, "md5sum": "", "id": "f251d744-21b9-11e1-acc8-e41f13b8ed90", "host": "172.20.101.11", "file": "/secureimages/wrksp-180/tmpSRa4FbRepoVMS7ca46e32-eaad-11e0-89f3-0019bb33e0eeclovr-standard-2011-07-01-03-00-31.raw", "message": "The output file could not be opened [Errno 2] No such file or directory: u'/secureimages/wrksp-180/tmpSRa4FbRepoVMS7ca46e32-eaad-11e0-89f3-0019bb33e0eeclovr-standard-2011-07-01-03-00-31.raw.lantorrent'", "port": 2893}

grep send error string:

2011-12-08 12:08:20,959 - WARNING - send error 506 172.20.101.11:2893[{u'rename': True, u'id': u'f251d744-21b9-11e1-acc8-e41f13b8ed90', u'filename': u'/secureimages/wrksp-180/tmpSRa4FbRepoVMS7ca46e32-eaad-11e0-89f3-0019bb33e0eeclovr-standard-2011-07-01-03-00-31.raw'}] A connection error occured on send 172.20.101.11:2893 [Errno 104] Connection reset by peer 2011-12-08 12:10:03,892 - WARNING - send error 506 172.20.101.11:2893[{u'rename': True, u'id': u'f251d744-21b9-11e1-acc8-e41f13b8ed90', u'filename': u'/secureimages/wrksp-180/tmpSRa4FbRepoVMS7ca46e32-eaad-11e0-89f3-0019bb33e0eeclovr-standard-2011-07-01-03-00-31.raw'}] A connection error occured on send 172.20.101.11:2893 [Errno 104] Connection reset by peer

all the lantorrent process' still exist on VMMs.

database is corrupted again.

igs-jeff commented 12 years ago

After patching daemon.py in ve, if I:

launch 10 instances and terminate immediately 3 times launch 10 more instances without termination, status stays in pending

If you monitor lantorrent processes across all VMMs, you will get (3 x 10 x 4), 120 initially, eventually down to 0 (takes about 20? minutes or so)

on service node (/var/log/messages):

Dec 16 09:57:36 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package Dec 16 09:57:36 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047456-18401 (res:4), deleting Dec 16 09:57:37 grinch python: abrt: detected unhandled Python exception in /opt/nimbus-2.8/ve/bin/ltrequest Dec 16 09:57:37 grinch abrtd: dumpsocket: New client connected Dec 16 09:57:37 grinch abrtd: dumpsocket: Saved Python crash dump of pid 18419 to /var/spool/abrt/pyhook-1324047457-18419 Dec 16 09:57:37 grinch abrtd: dumpsocket: Socket client disconnected Dec 16 09:57:37 grinch abrtd: Directory 'pyhook-1324047457-18419' creation detected Dec 16 09:57:37 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package Dec 16 09:57:37 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047457-18419 (res:4), deleting Dec 16 09:57:37 grinch python: abrt: detected unhandled Python exception in /opt/nimbus-2.8/ve/bin/ltrequest Dec 16 09:57:37 grinch abrtd: dumpsocket: New client connected Dec 16 09:57:37 grinch abrtd: dumpsocket: Saved Python crash dump of pid 18429 to /var/spool/abrt/pyhook-1324047457-18429 Dec 16 09:57:37 grinch abrtd: dumpsocket: Socket client disconnected Dec 16 09:57:37 grinch abrtd: Directory 'pyhook-1324047457-18429' creation detected Dec 16 09:57:37 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package Dec 16 09:57:37 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047457-18429 (res:4), deleting Dec 16 09:57:37 grinch python: abrt: detected unhandled Python exception in /opt/nimbus-2.8/ve/bin/ltrequest Dec 16 09:57:37 grinch abrtd: dumpsocket: New client connected Dec 16 09:57:37 grinch abrtd: dumpsocket: Saved Python crash dump of pid 18449 to /var/spool/abrt/pyhook-1324047457-18449 Dec 16 09:57:37 grinch abrtd: dumpsocket: Socket client disconnected Dec 16 09:57:37 grinch abrtd: Directory 'pyhook-1324047457-18449' creation detected Dec 16 09:57:37 grinch abrtd: Executable '/opt/nimbus-2.8/ve/bin/ltrequest' doesn't belong to any package Dec 16 09:57:37 grinch abrtd: Corrupted or bad crash /var/spool/abrt/pyhook-1324047457-18449 (res:4), deleting

at this point, I can not start any more instances. Restart the service node to fix the above error.

After service node restarted, it still shows the last 10 instances in pending state.

I can then terminate these 10 instances, and launch new ones

buzztroll commented 12 years ago

This specific issue appears to be fix. We will wait for verification from testing on the next RC