Closed ChrisThibodeaux closed 2 months ago
Correction to the setup steps in my post. This error is also happening when using a single VMSS instance. All other info applies.
Found the issue. In the def stop(self)
of az.py, this line was deleting the machine from the database before AnalysisManager could release it here. I suppose this only happened when the last task was taken because the azure machinery was not attempting to scale down until there were no relevant tasks remaining.
I will have a fix for this in the PR bringing azure machinery up to date.
Hey Chris, what's up? We are facing the same issue but with aws.py. This is the error we got: "sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched"
"Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in exit next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 339, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 231, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1002, in count_machines_available return machines.count() "
Do you think it's the same issue you were faced with? if so, can you share the solution you made?
@dor15 I think it is exactly like the issue I faced. You are going to want to look at aws.py
and review the two methods stop
and release
. In essence, if you are removing the machine in stop
, you are going to have a bad time. That must occur in release
. Should this be the actual issue, the fix is to move machine deletion into the release
method and out of stop
.
This is where stop and release are happening in the flow of a sandbox run: https://github.com/kevoreilly/CAPEv2/blob/master/lib/cuckoo/core/analysis_manager.py#L292-L341
Notice that the stop occurs first by calling self.machinery_manager.stop_machine(self.machine)
which is just this:
def stop_machine(self, machine: Machine) -> None:
self.machinery.stop(machine.label)
You can see the fix that I made for the azure machinery: before after
Honestly, I'm surprised you are running into that now and no one else has reported it before. If you do happen to find the problem and the above fixes it for you, please create a pull request with the changes. It goes a long way!
maybe it only related to cloud providers? i don't have those with kvm
@doomedraven I think it is most likely how az/aws.py have chosen to deal with autoscaling. The az module was having issues because it was deleting the machine from the database in the stop
call. Looking at aws.py, I see that only happens when self._is_autoscaled
is true for the machine its called on. When the AnalysisManager tells the MachineryManager to release the machine, it has been removed from the database, so the call using that machine's ID will result in stale data errors.
I don't think any other machinery module explicitly removes machines from the database as the normal flow in stop
. Thus, when the MachineryManager attempts to release it, the machine is still present in the database.
@dor15 I would be very interested in knowing a bit more about your setup. Have I correctly guessed that you are using autoscaling?
so maybe we need to backport that aws feature to az?
im gonna keep this open for some time till we see what gonna do, as is summer and i don't have a lot of time to properly concentrate on big tasks here
Honestly, I'm surprised you are running into that now and no one else has reported it before. If you do happen to find the problem and the above fixes it for you, please create a pull request with the changes. It goes a long way!
more than that, in the original code, in the part of "def _delete_machine_form_db(self, label):", it is trying to create a new "session" - session = db.Session()
and it got the following exception:
"/opt/CAPEv2/modules/machinery/aws.py" 435L, 16769B 107,18 21% with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in exit next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 325, in machine_running self.machinery_manager.stop_machine(self.machine) File "/opt/CAPEv2/lib/cuckoo/core/machinery_manager.py", line 308, in stop_machine self.machinery.stop(machine.label) File "/opt/CAPEv2/modules/machinery/aws.py", line 300, in stop self._delete_machine_form_db(label) File "/opt/CAPEv2/modules/machinery/aws.py", line 107, in _delete_machine_form_db session = db.Session() NameError: name 'db' is not defined
So I used the existing session with "self.db.session" to proceed with running the code, and then I got the error I described above. I agree with you, and it is a bit weird that no one reported at least about this issue
@doomedraven I think it is most likely how az/aws.py have chosen to deal with autoscaling. The az module was having issues because it was deleting the machine from the database in the
stop
call. Looking at aws.py, I see that only happens whenself._is_autoscaled
is true for the machine its called on. When the AnalysisManager tells the MachineryManager to release the machine, it has been removed from the database, so the call using that machine's ID will result in stale data errors.I don't think any other machinery module explicitly removes machines from the database as the normal flow in
stop
. Thus, when the MachineryManager attempts to release it, the machine is still present in the database.@dor15 I would be very interested in knowing a bit more about your setup. Have I correctly guessed that you are using autoscaling?
yes -
from aws.conf: [autoscale]
Enable auto-scale in cuckoo(by setting autoscale = yes). recommended for better performance. autoscale = yes
@dor15 I don't see session = db.session
in the file at this line. I would make sure you are on the most current commit before going any further.
Anyway, try using this and see if it fixes the problem.
def stop(self, label):
"""
Stops a virtual machine.
If the machine has initialized from autoscaled component, then terminate it.
@param label: virtual machine label.
@raise CuckooMachineError: if unable to stop.
"""
log.debug("Stopping vm %s" % label)
status = self._status(label)
if status == AWS.POWEROFF:
raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)
if self._is_autoscaled(self.ec2_machines[label]):
self.ec2_machines[label].terminate()
else:
self.ec2_machines[label].stop(Force=True)
self._wait_status(label, AWS.POWEROFF)
self._restore(label)
"""override Machinery method"""
def release(self, machine: Machine) -> Machine:
"""
we override it to have the ability to run start_or_create_machines() after unlocking the last machine
Release a machine.
@param label: machine label.
"""
retval = super(AWS, self).release(machine)
if self._is_autoscaled(self.ec2_machines[machine.label]):
self._delete_machine_form_db(machine.label)
self.dynamic_machines_count -= 1
self._start_or_create_machines()
return retval
I don't know what self.ec2_machines[label].terminate()
does, so I left it alone. I shifted down the machine deletion and count reduction to immediately after the machine's release.
yes, I meant this line https://github.com/kevoreilly/CAPEv2/blob/d0732ee9d4b01184f86136336ee4ed5af98c34ca/modules/machinery/aws.py#L108 session = self.db.Session(),
instead of this I use "self.db.session" before any session action, for example - machine = session.query(Machine).filter_by(label=label).first() --> machine = self.db.session.query(Machine).filter_by(label=label).first()
I'll try to use your fix and update. thanks!
trying to apply your fix and now getting a new error:
2024-08-28 09:02:48,881 [lib.cuckoo.core.analysis_manager] ERROR: Task #100: failure in AnalysisManager.run Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 322, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 207, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1001, in count_machines_available return machines.count() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count return self._from_self(col).enable_eagerloads(False).scalar() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar ret = self.one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one return self._iter().one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter result = self.session.execute( File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute conn = self._connection_for_bind(bind) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.
it's probably related to the way I create/use the current session in "_delete_machine_form_db" -
` def _delete_machine_form_db(self, label): """ cuckoo's DB class does not implement machine deletion, so we made one here :param label: the machine label """ try: self.db.session.begin_nested() from lib.cuckoo.core.database import Machine
machine = self.db.session.query(Machine).filter_by(label=label).first()
if machine:
self.db.session.delete(machine)
self.db.session.commit()
except SQLAlchemyError as e:
log.debug("Database error removing machine: {0}".format(e))
self.db.session.rollback()
return
finally:
self.db.session.close()`
trying to apply your fix and now getting a new error:
2024-08-28 09:02:48,881 [lib.cuckoo.core.analysis_manager] ERROR: Task #100: failure in AnalysisManager.run Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 322, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 207, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1001, in count_machines_available return machines.count() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count return self._from_self(col).enable_eagerloads(False).scalar() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar ret = self.one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one return self._iter().one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter result = self.session.execute( File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute conn = self._connection_for_bind(bind) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.
it's probably related to the way I create/use the current session in "_delete_machine_form_db" -
` def _delete_machine_form_db(self, label): """ cuckoo's DB class does not implement machine deletion, so we made one here :param label: the machine label """ try: self.db.session.begin_nested() from lib.cuckoo.core.database import Machine
machine = self.db.session.query(Machine).filter_by(label=label).first() if machine: self.db.session.delete(machine) self.db.session.commit() except SQLAlchemyError as e: log.debug("Database error removing machine: {0}".format(e)) self.db.session.rollback() return finally: self.db.session.close()`
the issue was solved after I comment-out the last line, and now it deletes the "old" machine and creates new one, and the whole process ends successfully
self.db.session.close()
But I'm not sure how it will affect the rest of the code (if it will affect anything).
These are the final changes I made in the code:
def stop(self, label):
log.info("Stopping vm %s" % label)
status = self._status(label)
if status == AWS.POWEROFF:
raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)
if self._is_autoscaled(self.ec2_machines[label]):
self.ec2_machines[label].terminate()
else:
self.ec2_machines[label].stop(Force=True)
self._wait_status(label, AWS.POWEROFF)
self._restore(label)
"""override Machinery method"""
def release(self, machine: Machine) -> Machine:
"""
we override it to have the ability to run start_or_create_machines() after unlocking the last machine
Release a machine.
@param label: machine label.
"""
retval = super(AWS, self).release(machine)
if self._is_autoscaled(self.ec2_machines[machine.label]):
log.info("release - start delete_machine_from_db")
self._delete_machine_form_db(machine.label)
self.dynamic_machines_count -= 1
self._start_or_create_machines()
return retval
def _delete_machine_form_db(self, label):
"""
cuckoo's DB class does not implement machine deletion, so we made one here
:param label: the machine label
"""
try:
self.db.session.begin_nested()
from lib.cuckoo.core.database import Machine
machine = self.db.session.query(Machine).filter_by(label=label).first()
if machine:
self.db.session.delete(machine)
self.db.session.commit()
except SQLAlchemyError as e:
log.debug("Database error removing machine: {0}".format(e))
self.db.session.rollback()
return
#finally:
# self.db.session.close()
@dor15 In stop
, can you replace self._delete_machine_form_db(label)
with super(AWS, self).delete_machine(label)
.
Total changes would now look like this:
def stop(self, label):
"""
Stops a virtual machine.
If the machine has initialized from autoscaled component, then terminate it.
@param label: virtual machine label.
@raise CuckooMachineError: if unable to stop.
"""
log.debug("Stopping vm %s" % label)
status = self._status(label)
if status == AWS.POWEROFF:
raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)
if self._is_autoscaled(self.ec2_machines[label]):
self.ec2_machines[label].terminate()
else:
self.ec2_machines[label].stop(Force=True)
self._wait_status(label, AWS.POWEROFF)
self._restore(label)
"""override Machinery method"""
def release(self, machine: Machine) -> Machine:
"""
we override it to have the ability to run start_or_create_machines() after unlocking the last machine
Release a machine.
@param label: machine label.
"""
retval = super(AWS, self).release(machine)
if self._is_autoscaled(self.ec2_machines[machine.label]):
super(AWS, self).delete_machine(machine.label)
self.dynamic_machines_count -= 1
self._start_or_create_machines()
return retval
I think the machinery file should do as little session spawning as possible, but that's just me. Let me know if this fixes the last part of the problem.
but we speak about azure not aws issue ?
@doomedraven Azure issue was solved with #2243. This is a separate issue with AWS. I have a PR draft waiting for confirmation that these changes fix it.
ah ok thank you
@dor15 In
stop
, can you replaceself._delete_machine_form_db(label)
withsuper(AWS, self).delete_machine(label)
.Total changes would now look like this:
def stop(self, label): """ Stops a virtual machine. If the machine has initialized from autoscaled component, then terminate it. @param label: virtual machine label. @raise CuckooMachineError: if unable to stop. """ log.debug("Stopping vm %s" % label) status = self._status(label) if status == AWS.POWEROFF: raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label) if self._is_autoscaled(self.ec2_machines[label]): self.ec2_machines[label].terminate() else: self.ec2_machines[label].stop(Force=True) self._wait_status(label, AWS.POWEROFF) self._restore(label) """override Machinery method""" def release(self, machine: Machine) -> Machine: """ we override it to have the ability to run start_or_create_machines() after unlocking the last machine Release a machine. @param label: machine label. """ retval = super(AWS, self).release(machine) if self._is_autoscaled(self.ec2_machines[machine.label]): super(AWS, self).delete_machine(machine.label) self.dynamic_machines_count -= 1 self._start_or_create_machines() return retval
I think the machinery file should do as little session spawning as possible, but that's just me. Let me know if this fixes the last part of the problem.
Hey,
I believe you meant to "replace self._delete_machine_form_db(label)
with super(AWS, self).delete_machine(label)
" in "release" func and not in "stop".
I tried to apply the changes that you suggested, but the process doesn't create a new machine instead of the one that finished the job -
2024-08-29 09:17:58,140 [modules.machinery.aws] INFO: Stopping vm i-0eef625a9c17d557b 2024-08-29 09:17:58,339 [modules.machinery.aws] INFO: instance state: running 2024-08-29 09:17:58,671 [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.
@dor15 You're right, I said that wrong. The code was what I had intended, though. I don't understand where you are getting that log line from [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.
. This would be easier if I could see your code. Do you have a fork that I can view?
You are passing the label (instance id), not the Machine.name.
You are passing the label (instance id), not the Machine.name.
@ethhart Great catch, that is definitely the problem.
@dor15 The line should be super(AWS, self).delete_machine(machine.name)
I'll update the PR draft.
@dor15 You're right, I said that wrong. The code was what I had intended, though. I don't understand where you are getting that log line from
[lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.
. This would be easier if I could see your code. Do you have a fork that I can view?
no, I don't have a fork, but the line is from the original code database
You are passing the label (instance id), not the Machine.name.
@ethhart Great catch, that is definitely the problem. @dor15 The line should be
super(AWS, self).delete_machine(machine.name)
I'll update the PR draft.
@ethhart, you are right; now it's working.
so the changes that @ChrisThibodeaux suggested work and solve the issue.
so, now that we merged fix, is this issue is relevant? as it says Azure, not AWS
@doomedraven I should have asked dor15 to split this off into a separate ticket sooner. I think we can close now.
Thanks
About accounts on capesandbox.com
This is open source and you are getting free support so be friendly!
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
When using Azure machinery with a VMSS, analysis should complete without database errors.
Current Behavior
When the last available task has been assigned to a machine, any other finishing analysis causes a sqlalchemy.orm.exc.StaleDataError. After, remaining tasks will always remain in the
running
state.Failure Information (for bugs)
Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.
Steps to Reproduce
Please provide detailed steps for reproducing the issue.
Context
Sqalchemy version = 1.4.50
Using Azure machinery with auto-scaling. The ScalingBoundSemaphore usage in the master branch has a bug when used with Azure auto-scaling that consistently leads to deadlocking. I have implemented a workaround that updates the semaphore in analysis_manager immediately before the machine is locked. If that is relevant, I can share the fix I have for that as well.
Failure Logs
Task 8 was the last task. It began running while task 6 was running. When task 6 finished, I got the error above. This is after successfully reporting on tasks 1-5 and 7.
Thoughts on where to begin looking for the cause?