sudiptob2 commented 1 year ago

closes #227 related to #153

I've created this PR to improve the way we express the idea. Most of the crawlers have already been implemented, but we still have a few more to go, specifically #229, #230, and #231.

However, we won't be able to completely remove the old crawl.py just yet. We need to have a discussion about the following points:

Where should we put the remaining pieces of code from crawl.py?
We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that?
the signature of the project_list crawler does not match with the interface, how to refactor this? Also, let's brainstorm ways to polish and enhance this draft.

mentions: @mshudrak @0xDeva @peb-peb and others.

sudiptob2 commented 1 year ago

I've created this PR to improve the way we express the idea. Most of the crawlers have already been implemented, but we still have a few more to go, specifically #229, #230, and #231.

However, we won't be able to completely remove the old crawl.py just yet. We need to have a discussion about the following points:

Where should we put the remaining pieces of code from crawl.py?
We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that?
the signature of the project_list crawler does not match with the interface, how to refactor this? Also, let's brainstorm ways to polish and enhance this draft.

mentions: @mshudrak @0xDeva @peb-peb and others.

under-hill commented 1 year ago

What if we add scan_config.get(config_setting, {}) as a third parameter to ICrawler.crawl allowing each crawler implementation to have access to resource-specific information from the scan_config? This should help with "storage_buckets", and others that may want more specific config settings in the future

under-hill commented 1 year ago

For GKE we can move the specific credential logic into a gke_client.py's get_service(), although this is playing a bit fast & loose with the IClient interface's stated contract for get_service

sudiptob2 commented 1 year ago

What if we add scan_config.get(config_setting, {}) as a third parameter to ICrawler.crawl allowing each crawler implementation to have access to resource-specific information from the scan_config? This should help with "storage_buckets", and others that may want more specific config settings in the future.

make sense :+1: should we make the third config parameter optional? Because there might be places that just want to call the crawl method.

under-hill commented 1 year ago

make sense 👍 should we make the third config parameter optional? Because there might be places that just want to call the crawl method.

Yep optional makes sense

mshudrak commented 1 year ago

Ok, let's discuss questions one by one. Starting with the last one. the signature of the project_list crawler does not match with the interface, how to refactor this?
Could you please elaborate on that? Do you mean def crawl(self, service: discovery.Resource) -> List[Dict[str, Any]]: here does not match others or something else?

mshudrak commented 1 year ago

I am a bit fallen behind of refactoring, apologizing if that was already discussed.

We couldn't include the storage_bucket and gke related crawler in the double loop method. Any ideas on how to handle that? - optional third parameter seems to makes sense here. As for GKE, yea one option is to move this logic into get_service() (obtaining gke_client: container_v1.services.cluster_manager.client.ClusterManagerClient for get_gke_clusters

mshudrak commented 1 year ago

Where should we put the remaining pieces of code from crawl.py? - so, if we address storage and GKE we have the following functions left:

def infinite_defaultdict(): - I'd just move that into the scanner.py for now or create another misc.py file and move models.py into misc.py too. There are some functions in the scanner.py that can be added there too. def get_sas_for_impersonation( #231 addressing it? def get_service_accounts - #231 addressing it?

sudiptob2 commented 1 year ago

Ok, let's discuss questions one by one. Starting with the last one. the signature of the project_list crawler does not match the interface, how to refactor this? Could you please elaborate on that? Do you mean def crawl(self, service: discovery.Resource) -> List[Dict[str, Any]]: here does not match others or something else?

Hi @mshudrak thanks for your response :pray: You might already be up to speed with the conversation, but I'll attempt to provide further clarification in this comment. Hopefully, it will be helpful :muscle:

The Crawler interface is defined in the interface_crawler.py. Now if we look at get_bucket_names, get_gke_clusters, get_gke_images, get_sas_for_impersonation in the crawl.py these methods have slightly different signature than the interface. So we are discussing about that :slightly_smiling_face:

If we can address above-mentioned function the only leftover will be def infinite_defaultdict() and moving it into into misc.py make sense. But if we want to make it quick as of now we can move it to scanner.py

231 addresses the `get_service_accounts`. My plan is to discuss and refactor different signature ones (including

get_sas_for_impersonation) with this PR. We can make subtasks if required.

mshudrak commented 1 year ago

Thanks for the explanation. All right, I think I understood it right. It makes sense to have optional flag for storage related calls and making specific implementation in get_service for get_gke_clusters. As for the get_gke_images, I think we can easily refactor it to receive credentials instead of access_token. Access_token can easily be fetched from credentials. We actually do it here https://github.com/google/gcp_scanner/blob/ba8dd9a4533381bf0e5f16889ea7d27946d044d2/src/gcp_scanner/scanner.py#L327 before passing into this function. get_sas_for_impersonation is a bit different. It might not actually be a good fit for the crawler interface since it is just parsing iam_policy. The best place would be in the future misc.py but for now we can move it into the scanner.py

sudiptob2 commented 1 year ago

@under-hill , As I was refactoring get_bucket_names(), I discovered that the return type of this method is different, causing a violation of the interface contract. I'm unsure whether to keep it within the double loop or relocate it to misc.py and call outside of the double loop.

Edit: I have created this PR https://github.com/google/gcp_scanner/pull/234 with a possible refactoring attempt. Pleae take a look @under-hill @0xDeva @mshudrak

mshudrak commented 1 year ago

@sudiptob2 I'd relocate it to misc.py.

Edit: Ok, I see you created a separate PR with refactoring. No strong preference on that. Let's keep it as you implemented, then.

sudiptob2 commented 1 year ago

Hello everyone, this PR is getting quite large. As part of my work, I have moved the gke crawlers to misc_crawler.py and relocated other helper methods to scanner.py.

At this point, I believe it's appropriate to mark this PR as ready for review. I would love to hear your thoughts on this approach. If there are no bugs and merged, we can address any refinement suggestions through separate tickets. What do you think?

mentions: @mshudrak @under-hill @0xDeva

mshudrak commented 1 year ago

Makes sense, I will review this one :)

google / gcp_scanner

:recycle: refactor scanner.py to reduce repetative if statements. #228

231 addresses the `get_service_accounts`. My plan is to discuss and refactor different signature ones (including

google / gcp_scanner

:recycle: refactor scanner.py to reduce repetative if statements. #228

231 addresses the get_service_accounts. My plan is to discuss and refactor different signature ones (including

231 addresses the `get_service_accounts`. My plan is to discuss and refactor different signature ones (including