google / gcp_scanner

A comprehensive scanner for Google Cloud
Apache License 2.0
305 stars 95 forks source link

feat ✨: make the scan loop asynchronous #237

Closed peb-peb closed 1 year ago

peb-peb commented 1 year ago

Description

This issue proposes to make the main scan loop asynchronous. This will allow the scanner to continue scanning other resources while it is waiting for the results of a long-running operation. This will improve the performance of the scanner and make it more responsive to user input.

image

The scan loop would call the crawlers asynchronously.

image

Changes

Checklist

Additional Notes

I believe that the changes made to the main scan loop will improve the performance of the scanner.

After #228, we can use the asyncio.gather() to achieve the similar task.

peb-peb commented 1 year ago

Issues

I'm running into the following issue for quite some time when trying to run and test the following code, I'm getting the following error :point_down: :(

image

mshudrak commented 1 year ago

Hmm, I was unable to reproduce this problem. Running it on python 3.9.2. However, let's take a step back and think what we actually need to make asynchronous. I don't think that we can do it for project_list and project_info queries since it is required for the functions later on in the scanner loop to fetch specific resources. I made the change in my code and so far I see the following issue:

2023-07-05 21:24:26 - INFO - Retrieving Compute Snapshots
2023-07-05 21:24:26 - INFO - Failed to get compute snapshots in the test-gcp-scanner-2
2023-07-05 21:24:26 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f1302bcf0c0>)
2023-07-05 21:24:26 - INFO - Retrieving Subnets
2023-07-05 21:24:27 - INFO - Failed to get subnets in the test-gcp-scanner-2
2023-07-05 21:24:27 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f1302bb0180>)
2023-07-05 21:24:27 - INFO - Retrieving Firewall Rules
2023-07-05 21:24:27 - INFO - Failed to get firewall rules in the test-gcp-scanner-2
2023-07-05 21:24:27 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f13031fbb00>)
2023-07-05 21:24:27 - INFO - Retrieving app services
2023-07-05 21:24:27 - INFO - Failed to retrieve App services for project test-gcp-scanner-2
2023-07-05 21:24:27 - INFO - (<class 'TypeError'>, TypeError("object dict can't be used in 'await' expression"), <traceback object at 0x7f1302614a40>)
2023-07-05 21:24:27 - INFO - Retrieving GCS Buckets
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/__main__.py", line 22, in <module>
    scanner.main()
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/scanner.py", line 590, in main
    asyncio.run(crawl_loop(sa_tuples, args.output, scan_config, args.light_scan,
  File "/usr/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.9/asyncio/base_events.py", line 642, in run_until_complete
    return future.result()
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/scanner.py", line 293, in crawl_loop
    project_result['storage_buckets'] = await CrawlerFactory.create_crawler(
  File "/home/mshudrak/.local/lib/python3.9/site-packages/gcp_scanner/crawler/storage_buckets_crawler.py", line 49, in crawl
    response = await request.execute()
TypeError: object dict can't be used in 'await' expression

It is basically happening for all the calls to specific resources.

peb-peb commented 1 year ago

I don't think that we can do it for project_list and project_info queries since it is required for the functions later on in the scanner loop to fetch specific resources.

I'll try it and update on the progress.

mshudrak commented 1 year ago

I am far from expert in asyncio but it seems like request.execute() is not a coroutine and it is complaining if I start second loop there. One solution could be to use nest_asyncio but so far I have trouble making it work.

mshudrak commented 1 year ago

Are you sure you are familiar enough with asyncio to implement this? We are totally fine if you decide to go with classic python multithreading or multiprocessing...

peb-peb commented 1 year ago

I'll give it a last shot and then switch over to it...

peb-peb commented 1 year ago

I have implemented asyncio. Things that had to be done:

For Example: according to this, the below code should return the output in 2sec, which it does: image

But, when I try to apply the same to our tool, it still requests in the same way and consumes the same amount of time. So, the solutions would be:

What should be the next steps? @mshudrak @ZetaTwo

Some related discussion on this similar issues on google-api-python-client:

mshudrak commented 1 year ago

I'd go for multiprocessing ThreadPool for GCP resource requests and I'd do actual multiprocessing for project base parallelism.

peb-peb commented 1 year ago

I'd go for multiprocessing ThreadPool for GCP resource requests and I'd do actual multiprocessing for project base parallelism.

ok :+1:

peb-peb commented 1 year ago

I'll close this and send a new draft PR with the required changes.