feat ✨: make the scan loop asynchronous

peb-peb commented 1 year ago

Description

This issue proposes to make the main scan loop asynchronous. This will allow the scanner to continue scanning other resources while it is waiting for the results of a long-running operation. This will improve the performance of the scanner and make it more responsive to user input.

The scan loop would call the crawlers asynchronously.

Changes

The main scan loop has been converted to an asynchronous function.
The asyncio library has been used to manage the asynchronous execution of the scan loop.
The await keyword has been used to wait for the results of long-running operations in all Crawlers, i.e., request.execute()

Checklist

[x] I have read and followed the contributing guidelines.
[ ] I have tested my changes thoroughly and they work as expected.
[ ] I have added necessary tests for the changes made.
[ ] I have updated the documentation to reflect the changes made.
[ ] My code follows the project's coding style and standards.
[ ] I have added appropriate commit messages and comments for my changes.

Additional Notes

I believe that the changes made to the main scan loop will improve the performance of the scanner.

After #228, we can use the asyncio.gather() to achieve the similar task.

peb-peb commented 1 year ago

Issues

I'm running into the following issue for quite some time when trying to run and test the following code, I'm getting the following error :point_down: :(

mshudrak commented 1 year ago

Hmm, I was unable to reproduce this problem. Running it on python 3.9.2. However, let's take a step back and think what we actually need to make asynchronous. I don't think that we can do it for project_list and project_info queries since it is required for the functions later on in the scanner loop to fetch specific resources. I made the change in my code and so far I see the following issue:

2023-07-05 21:24:26 - INFO - Retrieving Compute Snapshots
2023-07-05 21:24:26 - INFO - Failed to get compute snapshots in the test-gcp-scanner-2
2023-07-05 21:24:26 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f1302bcf0c0>)
2023-07-05 21:24:26 - INFO - Retrieving Subnets
2023-07-05 21:24:27 - INFO - Failed to get subnets in the test-gcp-scanner-2
2023-07-05 21:24:27 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f1302bb0180>)
2023-07-05 21:24:27 - INFO - Retrieving Firewall Rules
2023-07-05 21:24:27 - INFO - Failed to get firewall rules in the test-gcp-scanner-2
2023-07-05 21:24:27 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f13031fbb00>)
2023-07-05 21:24:27 - INFO - Retrieving app services
2023-07-05 21:24:27 - INFO - Failed to retrieve App services for project test-gcp-scanner-2
2023-07-05 21:24:27 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f1302614a40>)
2023-07-05 21:24:27 - INFO - Retrieving GCS Buckets
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/__main__.py", line 22, in <module>
    scanner.main()
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/scanner.py", line 590, in main
    asyncio.run(crawl_loop(sa_tuples, args.output, scan_config, args.light_scan,
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/scanner.py", line 293, in crawl_loop
    project_result['storage_buckets'] = await CrawlerFactory.create_crawler(
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/crawler/storage_buckets_crawler.py", line 49, in crawl
    response = await request.execute()
TypeError: object dict can't be used in 'await' expression

It is basically happening for all the calls to specific resources.

peb-peb commented 1 year ago

I don't think that we can do it for project_list and project_info queries since it is required for the functions later on in the scanner loop to fetch specific resources.

I'll try it and update on the progress.

mshudrak commented 1 year ago

I am far from expert in asyncio but it seems like request.execute() is not a coroutine and it is complaining if I start second loop there. One solution could be to use nest_asyncio but so far I have trouble making it work.

mshudrak commented 1 year ago

Are you sure you are familiar enough with asyncio to implement this? We are totally fine if you decide to go with classic python multithreading or multiprocessing...

peb-peb commented 1 year ago

I'll give it a last shot and then switch over to it...

peb-peb commented 1 year ago

I have implemented asyncio. Things that had to be done:

create individual tasks for each IO-dependent calls.
make a pool and await them
and then it only the time of the longest request (i.e., if we do not limit the number of tasks running

For Example: according to this, the below code should return the output in 2sec, which it does:

But, when I try to apply the same to our tool, it still requests in the same way and consumes the same amount of time. So, the solutions would be:

use multiprocessing ThreadPool
use asyncio.Future or create a decorator to await it

What should be the next steps? @mshudrak @ZetaTwo

Some related discussion on this similar issues on google-api-python-client:

mshudrak commented 1 year ago

I'd go for multiprocessing ThreadPool for GCP resource requests and I'd do actual multiprocessing for project base parallelism.

peb-peb commented 1 year ago

I'd go for multiprocessing ThreadPool for GCP resource requests and I'd do actual multiprocessing for project base parallelism.

ok :+1:

peb-peb commented 1 year ago

I'll close this and send a new draft PR with the required changes.

google / gcp_scanner