graph --create command hanging

beacomni commented 4 years ago

graph --create churns along for about one hour, continuing to print progress output. Then hangs. The last line of the hanging output is

Found new edge: role/aws-service-role/robomaker.amazonaws.com/AWSServiceRoleForRoboMaker can use Lambda to create a new function with arbitrary code, then pass and access role/service-role/[one of our roles]

Also, when trying to run pmapper repl, Did not find file at: /Users/beacomni/Library/Application Support/com.nccgroup.principalmapper/947682355454

There are many lines of "Found new edge". Is there some upper bound on edges that pmapper can handle? Is there a way to get log output that might suggest why it is hanging?

Thanks.

ncc-erik-steringer commented 4 years ago

Hi @beacomni ,

A couple questions to start:

Is this the master branch, v1.1.0-dev branch, or PyPi installation code?
What scale of Lambda functions and Lambda Execution Roles (IAM Roles that trust lambda.amazonaws.com are in the account? A power of 10 (1, 10, 100, 1000, etc.) is fine.

I suspect, based on the last line of output, that it's taking a while to chew through the Lambda functions in the account. I have a few suggestions in the meantime:

Install and try the code from v1.1.0-dev if you aren't already.
Run with the --debug parameter to get debug output (you'll likely see a flurry of output during this stage).
Create a script that uses the principalmapper library to gather the graph together, but exclude the lambda service.

ncc-erik-steringer commented 3 years ago

Hi @beacomni : wanted to check in. Do you have any follow-up questions? Were the previous suggestions helpful?

Clete2 commented 3 years ago

I'm having the same issue, that it hangs on generating edges for Lambda:

2021-08-24 11:37:36+0000 | Found new edge: role/<snip> can update the trust document to access role/<snip>

2021-08-24 11:37:36+0000 | Pulling data on Lambda functions
2021-08-24 11:37:51+0000 | Generating Edges based on Lambda data.

It's been on that line since 8/24. The date is now 8/30. We have a big account with 1780 Lambdas currently. On a smaller account this step took a while but we extrapolated that analyzing this many Lambdas should have only taken a few hours. Yet, a single CPU has been pegged for 6 days.

We are on version 1.1.3 from pip. Adding debug doesn't give us any additional information.

I am running pmapper without Lambda now, but that removes a large benefit of pmapper. @ncc-erik-steringer - Do you have any advice?

ncc-erik-steringer commented 3 years ago

Reopening this issue in light of @Clete2 's comment.

Current code in master branch: https://github.com/nccgroup/PMapper/blob/722efec4f7c89a5a440facd8a5bd055c88db7194/principalmapper/graphing/lambda_edges.py#L72

Current algorithm is:

Pull and collate a list of Lambda functions from each region
For each possible "destination" Node:
- Verify that it's an IAM Role that can be assumed by lambda.amazonaws.com (otherwise continue)
- For each possible "source" Node:
- Check and store info on if the source node can pass the destination node to Lambda (simulation)
- Check if the source node can create a Lambda function: if so, and can pass the role, create an Edge (simulation)
- Create a copy of the list of Lambda functions and append data on if the source node can edit the function's code or function's configuration (two simulations)
- For each item in the list-copy:
  - Check if the function's execution role is the destination role (otherwise continue)
  - If the source node can edit the function's code (per earlier check) then create an Edge and break
- For each item in the list-copy:
  - If the source node can edit the function's code, edit the function's configuration, and pass the destination role, then create an Edge and break

The time-consuming stuff is probably the simulations. The number of those we currently do (worst-case) is:

LR = number of roles that Lambda can assume
LF = number of Lambda functions
N = number of IAM Users and Roles in the account

LR * (2N + (2N * LF))

Meaning an account with 100 users and roles, 8 of which are assumable roles for Lambda, and 1000 Lambda functions will result in ~1.6M simulation calls. If our simulator can process 2 simulations/sec (need to benchmark to see what this actually looks like, I could see the ReadOnlyAccess policy being slow to process for PMapper) then that'd be 9.26 days to finish.

A couple things I'll look at:

Adding a check to see if "source" can call the edit/reconfigure Lambda operations for any function (*) to short-circuit the 2N * LF checks
Skipping edit/reconfigure edges if the caller can do the "create a function and pass the role" approach
Benchmarking simulation rates across different actions/resources/policies
Adding an LRU cache at the policy-layer (i.e. effect/action/resource/conditions -> policy has matching statement) to reduce the impact of larger policies during simulation

Clete2 commented 3 years ago

@ncc-erik-steringer thanks for the comment. I'll adapt your math to my account.

LR * (2N + (2N * LF))
1780 * (2 * 2872 + (2 * 2872 * 1780)) <-- I assume the # of assumable Lambda roles ~= the number of Lambdas. We generally develop one role per Lambda
1780 * (5744 + (5744 * 1780))
1780 * (5744 + 10224320)
1780 * 10230064
18,209,513,920

18,209,513,920 / 2 simulations per second = 9,104,756,960 seconds
9,104,756,960 / 60 = 151,745,949.33333334 minutes
151,745,949.33333334 / 60 = 2,529,099.1555555556 hours
2,529,099.1555555556 / 24 = 105,379.1314814815 days
105,379.1314814815 / 365.2425 = 288.5182624735 years

Yes, I know our account is way too big.

One thing I am worried about is that my access token expires every hour, and I am worried once it's done calculating Lambda edges, and we're all 6 feet under, it'll move on to the next step and try to call AWS APIs and fail due to an expired token.

ncc-erik-steringer commented 3 years ago

Just pushed 725c05d6aa331c880338b910ad5ed64df29092f2 to the v1.1.4-dev branch with some of the work to help cut down on Lambda authorization simulation calls. I suspect it isn't going to help your case out very much but it's worth a try in the meantime.

Looking at the caching options, it turns out it'll take some breaking changes to add some LRU caches like I wanted. Maybe for v2.0.0 some day.

For IAM Roles, you may be able to configure a profile with the AWS CLI using credential_process (https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sourcing-external.html) that might refresh? Otherwise you may need to write some code and use the botocore library's RefreshableCredentials class https://github.com/boto/botocore/blob/4927d3e3baa6a00226a1f017b638807beeb613a0/botocore/credentials.py#L368 .

Clete2 commented 3 years ago

Thanks for the new code. I have pulled it down and started running it on a c5.large instance. I'll leave this running for a week or so and let you know if it finishes or if I kill it.

For passwords I broke down and created a read-only user with static creds.

Clete2 commented 3 years ago

@ncc-erik-steringer it finished! On Sept 17th.

Since 9/17 I have been running the visualize command and it's still chugging.

ncc-erik-steringer commented 3 years ago

So roughly 16 days to complete? That's actually faster than I anticipated even with the changes. Thank you for being able to put PMapper through the stress test.

For visualization, PMapper basically sends the data to graphviz and waits for it to do the job. However, you may be able to output to a different filetype (.graphml) and use another renderer that works faster (Gephi or Cytoscape). That's controlled with the --filetype param.

Clete2 commented 3 years ago

Thanks for the tip. I just started the following command:

pmapper --account <myacct> visualize --filetype graphml --only-privesc

The png command is still running since the 17th and I did not kill it. Will update once I make some progress.

Clete2 commented 3 years ago

All done!

The png command failed because dot was not on the path.

The graphml one worked and I'm playing around with it in Gephi. Thanks very much for your support.

Yashvendra commented 1 year ago

Hey @ncc-erik-steringer,

Skipping edit/reconfigure edges if the caller can do the "create a function and pass the role" approach

I'm having trouble understanding why have we skipped this edge creation which can indicate another possible privilege escalation path. I get we are trying to reduce time complexity but skipping checks will reduce the detection of the attack surface, don't you think? Or am I missing something here, will you please care to explain?

nccgroup / PMapper

graph --create command hanging #55