kevoreilly / CAPEv2

Malware Configuration And Payload Extraction
https://capesandbox.com/analysis/
Other
2k stars 425 forks source link

sqalchemy StaleDataError with Azure machinery #2242

Closed ChrisThibodeaux closed 2 months ago

ChrisThibodeaux commented 3 months ago

About accounts on capesandbox.com

This is open source and you are getting free support so be friendly!

Prerequisites

Please answer the following questions for yourself before submitting an issue.

Expected Behavior

When using Azure machinery with a VMSS, analysis should complete without database errors.

Current Behavior

When the last available task has been assigned to a machine, any other finishing analysis causes a sqlalchemy.orm.exc.StaleDataError. After, remaining tasks will always remain in the running state.

Failure Information (for bugs)

Please help provide information about the failure if this is a bug. If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. Allow VMSS max size to be greater than one.
  2. Submit any number of tasks greater than one. We want at least two machines to be running analysis.
  3. When the last task is assigned, whichever running task finishes next produced a StaleDataError.

Context

Sqalchemy version = 1.4.50

Using Azure machinery with auto-scaling. The ScalingBoundSemaphore usage in the master branch has a bug when used with Azure auto-scaling that consistently leads to deadlocking. I have implemented a workaround that updates the semaphore in analysis_manager immediately before the machine is locked. If that is relevant, I can share the fix I have for that as well.

Question Answer
Git commit 303bdcc2
OS version Ubuntu 22.04

Failure Logs

Jul 20 17:33:16 hostname python3[251990]: 2024-07-20 17:33:16,107 [lib.cuckoo.core.guest] DEBUG: Task #8: Analysis is still running (id=cape-guest-vmss-win10x64_0, ip=10.2.2.5)
Jul 20 17:33:16 hostname python3[251990]: 2024-07-20 17:33:16,923 [lib.cuckoo.core.guest] DEBUG: Task #6: Analysis is still running (id=cape-guest-vmss-win10x64_2, ip=10.2.2.6)
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,099 [lib.cuckoo.core.resultserver] DEBUG: Task #6 had connection reset by peer for <Context for b'LOG'>
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,100 [lib.cuckoo.core.resultserver] DEBUG: Task #6 had connection reset by peer for <Context for None>
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,100 [lib.cuckoo.core.resultserver] DEBUG: Task #6 had connection reset by peer for <Context for b'BSON'>
Jul 20 17:33:17 hostname python3[251990]: 2024-07-20 17:33:17,939 [lib.cuckoo.core.guest] INFO: Task #6: Analysis completed successfully (id=cape-guest-vmss-win10x64_2, ip=10.2.2.6)
Jul 20 17:33:18 hostname sudo[252717]: pam_unix(sudo:session): session closed for user root
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,004 [lib.cuckoo.core.plugins] DEBUG: Stopped auxiliary module: Sniffer
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,125 [lib.cuckoo.core.resultserver] DEBUG: Task #6: Stopped tracking machine 10.2.2.6
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,125 [modules.machinery.az] DEBUG: Stopping machine 'cape-guest-vmss-win10x64_2'
Jul 20 17:33:18 hostname python3[251990]: 2024-07-20 17:33:18,138 [lib.cuckoo.core.analysis_manager] ERROR: Task #6: failure in AnalysisManager.run
Jul 20 17:33:18 hostname python3[251990]: Traceback (most recent call last):
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 519, in run
Jul 20 17:33:18 hostname python3[251990]:     self.launch_analysis()
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 481, in launch_analysis
Jul 20 17:33:18 hostname python3[251990]:     success = self.perform_analysis()
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 455, in perform_analysis
Jul 20 17:33:18 hostname python3[251990]:     with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary():
Jul 20 17:33:18 hostname python3[251990]:   File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     next(self.gen)
Jul 20 17:33:18 hostname python3[251990]:   File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 346, in machine_running
Jul 20 17:33:18 hostname python3[251990]:     with self.db.session.begin():
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 235, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     with util.safe_reraise():
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     compat.raise_(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
Jul 20 17:33:18 hostname python3[251990]:     raise exception
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 233, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     self.commit()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 832, in commit
Jul 20 17:33:18 hostname python3[251990]:     self._prepare_impl()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 811, in _prepare_impl
Jul 20 17:33:18 hostname python3[251990]:     self.session.flush()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3449, in flush
Jul 20 17:33:18 hostname python3[251990]:     self._flush(objects)
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3588, in _flush
Jul 20 17:33:18 hostname python3[251990]:     with util.safe_reraise():
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
Jul 20 17:33:18 hostname python3[251990]:     compat.raise_(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
Jul 20 17:33:18 hostname python3[251990]:     raise exception
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 3549, in _flush
Jul 20 17:33:18 hostname python3[251990]:     flush_context.execute()
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 456, in execute
Jul 20 17:33:18 hostname python3[251990]:     rec.execute(self)
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/unitofwork.py", line 630, in execute
Jul 20 17:33:18 hostname python3[251990]:     util.preloaded.orm_persistence.save_obj(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 237, in save_obj
Jul 20 17:33:18 hostname python3[251990]:     _emit_update_statements(
Jul 20 17:33:18 hostname python3[251990]:   File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/persistence.py", line 1035, in _emit_update_statements
Jul 20 17:33:18 hostname python3[251990]:     raise orm_exc.StaleDataError(
Jul 20 17:33:18 hostname python3[251990]: sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched.

Task 8 was the last task. It began running while task 6 was running. When task 6 finished, I got the error above. This is after successfully reporting on tasks 1-5 and 7.

Thoughts on where to begin looking for the cause?

ChrisThibodeaux commented 3 months ago

Correction to the setup steps in my post. This error is also happening when using a single VMSS instance. All other info applies.

ChrisThibodeaux commented 3 months ago

Found the issue. In the def stop(self) of az.py, this line was deleting the machine from the database before AnalysisManager could release it here. I suppose this only happened when the last task was taken because the azure machinery was not attempting to scale down until there were no relevant tasks remaining.

I will have a fix for this in the PR bringing azure machinery up to date.

dor15 commented 2 months ago

Hey Chris, what's up? We are facing the same issue but with aws.py. This is the error we got: "sqlalchemy.orm.exc.StaleDataError: UPDATE statement on table 'machines' expected to update 1 row(s); 0 were matched"

"Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in exit next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 339, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 231, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1002, in count_machines_available return machines.count() "

Do you think it's the same issue you were faced with? if so, can you share the solution you made?

ChrisThibodeaux commented 2 months ago

@dor15 I think it is exactly like the issue I faced. You are going to want to look at aws.py and review the two methods stop and release. In essence, if you are removing the machine in stop, you are going to have a bad time. That must occur in release. Should this be the actual issue, the fix is to move machine deletion into the release method and out of stop.

This is where stop and release are happening in the flow of a sandbox run: https://github.com/kevoreilly/CAPEv2/blob/master/lib/cuckoo/core/analysis_manager.py#L292-L341

Notice that the stop occurs first by calling self.machinery_manager.stop_machine(self.machine) which is just this:

def stop_machine(self, machine: Machine) -> None:
    self.machinery.stop(machine.label)

You can see the fix that I made for the azure machinery: before after

ChrisThibodeaux commented 2 months ago

Honestly, I'm surprised you are running into that now and no one else has reported it before. If you do happen to find the problem and the above fixes it for you, please create a pull request with the changes. It goes a long way!

doomedraven commented 2 months ago

maybe it only related to cloud providers? i don't have those with kvm

ChrisThibodeaux commented 2 months ago

@doomedraven I think it is most likely how az/aws.py have chosen to deal with autoscaling. The az module was having issues because it was deleting the machine from the database in the stop call. Looking at aws.py, I see that only happens when self._is_autoscaled is true for the machine its called on. When the AnalysisManager tells the MachineryManager to release the machine, it has been removed from the database, so the call using that machine's ID will result in stale data errors.

I don't think any other machinery module explicitly removes machines from the database as the normal flow in stop. Thus, when the MachineryManager attempts to release it, the machine is still present in the database.

@dor15 I would be very interested in knowing a bit more about your setup. Have I correctly guessed that you are using autoscaling?

doomedraven commented 2 months ago

so maybe we need to backport that aws feature to az?

doomedraven commented 2 months ago

im gonna keep this open for some time till we see what gonna do, as is summer and i don't have a lot of time to properly concentrate on big tasks here

dor15 commented 2 months ago

Honestly, I'm surprised you are running into that now and no one else has reported it before. If you do happen to find the problem and the above fixes it for you, please create a pull request with the changes. It goes a long way!

more than that, in the original code, in the part of "def _delete_machine_form_db(self, label):", it is trying to create a new "session" - session = db.Session()

and it got the following exception:

"/opt/CAPEv2/modules/machinery/aws.py" 435L, 16769B 107,18 21% with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in exit next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 325, in machine_running self.machinery_manager.stop_machine(self.machine) File "/opt/CAPEv2/lib/cuckoo/core/machinery_manager.py", line 308, in stop_machine self.machinery.stop(machine.label) File "/opt/CAPEv2/modules/machinery/aws.py", line 300, in stop self._delete_machine_form_db(label) File "/opt/CAPEv2/modules/machinery/aws.py", line 107, in _delete_machine_form_db session = db.Session() NameError: name 'db' is not defined

So I used the existing session with "self.db.session" to proceed with running the code, and then I got the error I described above. I agree with you, and it is a bit weird that no one reported at least about this issue

dor15 commented 2 months ago

@doomedraven I think it is most likely how az/aws.py have chosen to deal with autoscaling. The az module was having issues because it was deleting the machine from the database in the stop call. Looking at aws.py, I see that only happens when self._is_autoscaled is true for the machine its called on. When the AnalysisManager tells the MachineryManager to release the machine, it has been removed from the database, so the call using that machine's ID will result in stale data errors.

I don't think any other machinery module explicitly removes machines from the database as the normal flow in stop. Thus, when the MachineryManager attempts to release it, the machine is still present in the database.

@dor15 I would be very interested in knowing a bit more about your setup. Have I correctly guessed that you are using autoscaling?

yes -

from aws.conf: [autoscale]

Enable auto-scale in cuckoo(by setting autoscale = yes). recommended for better performance. autoscale = yes

ChrisThibodeaux commented 2 months ago

@dor15 I don't see session = db.session in the file at this line. I would make sure you are on the most current commit before going any further.

Anyway, try using this and see if it fixes the problem.

    def stop(self, label):
        """
        Stops a virtual machine.
        If the machine has initialized from autoscaled component, then terminate it.
        @param label: virtual machine label.
        @raise CuckooMachineError: if unable to stop.
        """
        log.debug("Stopping vm %s" % label)

        status = self._status(label)

        if status == AWS.POWEROFF:
            raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

        if self._is_autoscaled(self.ec2_machines[label]):
            self.ec2_machines[label].terminate()
        else:
            self.ec2_machines[label].stop(Force=True)
            self._wait_status(label, AWS.POWEROFF)
            self._restore(label)

    """override Machinery method"""

    def release(self, machine: Machine) -> Machine:
        """
        we override it to have the ability to run start_or_create_machines() after unlocking the last machine
        Release a machine.
        @param label: machine label.
        """
        retval = super(AWS, self).release(machine)

        if self._is_autoscaled(self.ec2_machines[machine.label]):
            self._delete_machine_form_db(machine.label)
            self.dynamic_machines_count -= 1

        self._start_or_create_machines()
        return retval

I don't know what self.ec2_machines[label].terminate() does, so I left it alone. I shifted down the machine deletion and count reduction to immediately after the machine's release.

dor15 commented 2 months ago

yes, I meant this line https://github.com/kevoreilly/CAPEv2/blob/d0732ee9d4b01184f86136336ee4ed5af98c34ca/modules/machinery/aws.py#L108 session = self.db.Session(),

instead of this I use "self.db.session" before any session action, for example - machine = session.query(Machine).filter_by(label=label).first() --> machine = self.db.session.query(Machine).filter_by(label=label).first()

I'll try to use your fix and update. thanks!

dor15 commented 2 months ago

trying to apply your fix and now getting a new error: 2024-08-28 09:02:48,881 [lib.cuckoo.core.analysis_manager] ERROR: Task #100: failure in AnalysisManager.run Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 322, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 207, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1001, in count_machines_available return machines.count() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count return self._from_self(col).enable_eagerloads(False).scalar() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar ret = self.one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one return self._iter().one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter result = self.session.execute( File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute conn = self._connection_for_bind(bind) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.

it's probably related to the way I create/use the current session in "_delete_machine_form_db" -

` def _delete_machine_form_db(self, label): """ cuckoo's DB class does not implement machine deletion, so we made one here :param label: the machine label """ try: self.db.session.begin_nested() from lib.cuckoo.core.database import Machine

        machine = self.db.session.query(Machine).filter_by(label=label).first()
        if machine:
            self.db.session.delete(machine)
            self.db.session.commit()
    except SQLAlchemyError as e:
        log.debug("Database error removing machine: {0}".format(e))
        self.db.session.rollback()
        return
    finally:
        self.db.session.close()`
dor15 commented 2 months ago

trying to apply your fix and now getting a new error: 2024-08-28 09:02:48,881 [lib.cuckoo.core.analysis_manager] ERROR: Task #100: failure in AnalysisManager.run Traceback (most recent call last): File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 498, in run self.launch_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 460, in launch_analysis success = self.perform_analysis() File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 444, in perform_analysis with self.machine_running(), self.result_server(), self.network_routing(), self.run_auxiliary(): File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__ next(self.gen) File "/opt/CAPEv2/lib/cuckoo/core/analysis_manager.py", line 334, in machine_running self.machinery_manager.machinery.release(self.machine) File "/opt/CAPEv2/modules/machinery/aws.py", line 322, in release self._start_or_create_machines() File "/opt/CAPEv2/modules/machinery/aws.py", line 207, in _start_or_create_machines current_available_machines = self.db.count_machines_available() File "/opt/CAPEv2/lib/cuckoo/core/database.py", line 1001, in count_machines_available return machines.count() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 3176, in count return self._from_self(col).enable_eagerloads(False).scalar() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2893, in scalar ret = self.one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2870, in one return self._iter().one() File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/query.py", line 2916, in _iter result = self.session.execute( File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1716, in execute conn = self._connection_for_bind(bind) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/home/cape/.cache/pypoetry/virtualenvs/capev2-t2x27zRb-py3.10/lib/python3.10/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands.

it's probably related to the way I create/use the current session in "_delete_machine_form_db" -

` def _delete_machine_form_db(self, label): """ cuckoo's DB class does not implement machine deletion, so we made one here :param label: the machine label """ try: self.db.session.begin_nested() from lib.cuckoo.core.database import Machine

        machine = self.db.session.query(Machine).filter_by(label=label).first()
        if machine:
            self.db.session.delete(machine)
            self.db.session.commit()
    except SQLAlchemyError as e:
        log.debug("Database error removing machine: {0}".format(e))
        self.db.session.rollback()
        return
    finally:
        self.db.session.close()`

the issue was solved after I comment-out the last line, and now it deletes the "old" machine and creates new one, and the whole process ends successfully self.db.session.close()

But I'm not sure how it will affect the rest of the code (if it will affect anything).

These are the final changes I made in the code:

def stop(self, label):
    log.info("Stopping vm %s" % label)

    status = self._status(label)

    if status == AWS.POWEROFF:
        raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

    if self._is_autoscaled(self.ec2_machines[label]):
        self.ec2_machines[label].terminate()
    else:
        self.ec2_machines[label].stop(Force=True)
        self._wait_status(label, AWS.POWEROFF)
        self._restore(label)

"""override Machinery method"""

def release(self, machine: Machine) -> Machine:
    """
    we override it to have the ability to run start_or_create_machines() after unlocking the last machine
    Release a machine.
    @param label: machine label.
    """

    retval = super(AWS, self).release(machine)
    if self._is_autoscaled(self.ec2_machines[machine.label]):
        log.info("release - start delete_machine_from_db")
        self._delete_machine_form_db(machine.label)
        self.dynamic_machines_count -= 1

    self._start_or_create_machines()
    return retval

def _delete_machine_form_db(self, label):
    """
    cuckoo's DB class does not implement machine deletion, so we made one here
    :param label: the machine label
    """
    try:
        self.db.session.begin_nested()
        from lib.cuckoo.core.database import Machine

        machine = self.db.session.query(Machine).filter_by(label=label).first()
        if machine:
            self.db.session.delete(machine)
            self.db.session.commit()
    except SQLAlchemyError as e:
        log.debug("Database error removing machine: {0}".format(e))
        self.db.session.rollback()
        return
    #finally:
    #    self.db.session.close()
ChrisThibodeaux commented 2 months ago

@dor15 In stop, can you replace self._delete_machine_form_db(label) with super(AWS, self).delete_machine(label).

Total changes would now look like this:

    def stop(self, label):
        """
        Stops a virtual machine.
        If the machine has initialized from autoscaled component, then terminate it.
        @param label: virtual machine label.
        @raise CuckooMachineError: if unable to stop.
        """
        log.debug("Stopping vm %s" % label)

        status = self._status(label)

        if status == AWS.POWEROFF:
            raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

        if self._is_autoscaled(self.ec2_machines[label]):
            self.ec2_machines[label].terminate()
        else:
            self.ec2_machines[label].stop(Force=True)
            self._wait_status(label, AWS.POWEROFF)
            self._restore(label)

    """override Machinery method"""

    def release(self, machine: Machine) -> Machine:
        """
        we override it to have the ability to run start_or_create_machines() after unlocking the last machine
        Release a machine.
        @param label: machine label.
        """
        retval = super(AWS, self).release(machine)

        if self._is_autoscaled(self.ec2_machines[machine.label]):
            super(AWS, self).delete_machine(machine.label)
            self.dynamic_machines_count -= 1

        self._start_or_create_machines()
        return retval

I think the machinery file should do as little session spawning as possible, but that's just me. Let me know if this fixes the last part of the problem.

doomedraven commented 2 months ago

but we speak about azure not aws issue ?

ChrisThibodeaux commented 2 months ago

@doomedraven Azure issue was solved with #2243. This is a separate issue with AWS. I have a PR draft waiting for confirmation that these changes fix it.

doomedraven commented 2 months ago

ah ok thank you

dor15 commented 2 months ago

@dor15 In stop, can you replace self._delete_machine_form_db(label) with super(AWS, self).delete_machine(label).

Total changes would now look like this:

    def stop(self, label):
        """
        Stops a virtual machine.
        If the machine has initialized from autoscaled component, then terminate it.
        @param label: virtual machine label.
        @raise CuckooMachineError: if unable to stop.
        """
        log.debug("Stopping vm %s" % label)

        status = self._status(label)

        if status == AWS.POWEROFF:
            raise CuckooMachineError("Trying to stop an already stopped VM: %s" % label)

        if self._is_autoscaled(self.ec2_machines[label]):
            self.ec2_machines[label].terminate()
        else:
            self.ec2_machines[label].stop(Force=True)
            self._wait_status(label, AWS.POWEROFF)
            self._restore(label)

    """override Machinery method"""

    def release(self, machine: Machine) -> Machine:
        """
        we override it to have the ability to run start_or_create_machines() after unlocking the last machine
        Release a machine.
        @param label: machine label.
        """
        retval = super(AWS, self).release(machine)

        if self._is_autoscaled(self.ec2_machines[machine.label]):
            super(AWS, self).delete_machine(machine.label)
            self.dynamic_machines_count -= 1

        self._start_or_create_machines()
        return retval

I think the machinery file should do as little session spawning as possible, but that's just me. Let me know if this fixes the last part of the problem.

Hey, I believe you meant to "replace self._delete_machine_form_db(label) with super(AWS, self).delete_machine(label)" in "release" func and not in "stop". I tried to apply the changes that you suggested, but the process doesn't create a new machine instead of the one that finished the job -

2024-08-29 09:17:58,140 [modules.machinery.aws] INFO: Stopping vm i-0eef625a9c17d557b 2024-08-29 09:17:58,339 [modules.machinery.aws] INFO: instance state: running 2024-08-29 09:17:58,671 [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.

ChrisThibodeaux commented 2 months ago

@dor15 You're right, I said that wrong. The code was what I had intended, though. I don't understand where you are getting that log line from [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.. This would be easier if I could see your code. Do you have a fork that I can view?

ethhart commented 2 months ago

You are passing the label (instance id), not the Machine.name.

ChrisThibodeaux commented 2 months ago

You are passing the label (instance id), not the Machine.name.

@ethhart Great catch, that is definitely the problem. @dor15 The line should be super(AWS, self).delete_machine(machine.name)

I'll update the PR draft.

dor15 commented 2 months ago

@dor15 You're right, I said that wrong. The code was what I had intended, though. I don't understand where you are getting that log line from [lib.cuckoo.core.database] WARNING: i-0eef625a9c17d557b does not exist in the database.. This would be easier if I could see your code. Do you have a fork that I can view?

no, I don't have a fork, but the line is from the original code database

You are passing the label (instance id), not the Machine.name.

@ethhart Great catch, that is definitely the problem. @dor15 The line should be super(AWS, self).delete_machine(machine.name)

I'll update the PR draft.

@ethhart, you are right; now it's working.

so the changes that @ChrisThibodeaux suggested work and solve the issue.

doomedraven commented 2 months ago

so, now that we merged fix, is this issue is relevant? as it says Azure, not AWS

ChrisThibodeaux commented 2 months ago

@doomedraven I should have asked dor15 to split this off into a separate ticket sooner. I think we can close now.

doomedraven commented 2 months ago

Thanks