SNAT exhaustion on Azure / Github Actions

Lykathia commented 4 years ago

Hey all,

It appears that due to the leaveBodyOpen style functions, ports aren't getting reused and causing SNAT port exhaustion when being run w/ a large number of requests on AzDO or GHA.

Specifically, I came across this while trying to run a npm install from the jfrog cli, w/ build metadata enabled. Since there is an individual AQL call made for every single dependency to retrieve the checksum for said dependency - it quickly saturates the ports and causes a TCP timeout.

I had looked @ fixing this by writing 'better' AQL in the jfrog CLI, but it appears to cause a segfault on the Artifactory server side due to payload being too large. So instead I started tracking down the network calls, and was led here.

A brief search thru the git history didn't appear to turn up anything -- I'm wondering what the use case for the 'leave body open' connections is? An easy fix would be to not use it - this is certainly a hindrance for any NPM style builds w/ build-info enabled.

eyalbe4 commented 4 years ago

@Lykathia, Thanks for the deep drill down into the code! I think you are correct and may have found the root cause of the issue. The issue seems to be related to a massive refactor which both the jfrog-client-go and jfrog-cli projects went through recently. The fix for this issue should be in the jfrog-cli code base only though, because the body of the request should probably be closed there. The reason why the body is left open, is to allow the API which sent the request to read the body and then close it. I suspect that code with the missing close() is below. Notice the defer stream.Close() - it is missing with the latest released code. I added it below to show what should be fixed. I think this should fix the issue.

// Creates a function that fetches dependency data from Artifactory. Can be applied from a producer-consumer mechanism
func (nca *NpmCommandArgs) createGetDependencyInfoFunc(servicesManager *artifactory.ArtifactoryServicesManager) getDependencyInfoFunc {
    return func(dependencyIndex string) parallel.TaskFunc {
        return func(threadId int) error {
            name := nca.dependencies[dependencyIndex].name
            ver := nca.dependencies[dependencyIndex].version
            log.Debug(clientutils.GetLogMsgPrefix(threadId, false), "Fetching checksums for", name, "-", ver)
            stream, err := servicesManager.Aql(serviceutils.CreateAqlQueryForNpm(name, ver))
            if err != nil {
                return err
            }
            defer stream.Close()
            result, err := ioutil.ReadAll(stream)
            if err != nil {
                return err
            }
            parsedResult := new(aqlResult)
            if err = json.Unmarshal(result, parsedResult); err != nil {
                return errorutils.CheckError(err)
            }
            if len(parsedResult.Results) == 0 {
                log.Debug(clientutils.GetLogMsgPrefix(threadId, false), name, "-", ver, "could not be found in Artifactory.")
                return nil
            }
            nca.dependencies[dependencyIndex].artifactName = parsedResult.Results[0].Name
            nca.dependencies[dependencyIndex].checksum =
                &buildinfo.Checksum{Sha1: parsedResult.Results[0].Actual_sha1, Md5: parsedResult.Results[0].Actual_md5}
            log.Debug(clientutils.GetLogMsgPrefix(threadId, false), "Found", parsedResult.Results[0].Name,
                "sha1:", parsedResult.Results[0].Actual_sha1,
                "md5", parsedResult.Results[0].Actual_md5)
            return nil
        }
    }
}

It appears that this issue requires an immediate fix. We will issue a patch soon!

Or-Geva commented 4 years ago

@Lykathia, JFrog-CLI 1.39.3 is released and includes a fix for this issue. Feel free to update your JFrog-CLI and share your feedback.

Lykathia commented 4 years ago

@Or-Gabay this does not fix the issue, ports are still held open w/ exhaustion on all the ports, resulting in a TCP timeout.

@eyalbe4 I think I would have looked @ fixing the code here, using a functional closure to do the work inside the scope of the function being executed, instead of pushing responsibility for closure back up the stack. Or alternatively, doing an IO copy, closing the connection, and passing the data back up the stack. Dealing with all the pointers/references/threads across multiple go libraries seems like a more difficult problem then I would want to tackle! :D

Lykathia commented 4 years ago

Some different thoughts that floated to mind: Batching queries, to make less roundtrips, or putting in retries on failure.

Doesn't solve the TCP stuff directly, but does increase the overall robustness of the functionality in other ways (that would mitigate SNAT exhaustion possibilities).

eyalbe4 commented 4 years ago

@Lykathia, Can you please confirm that you are using version 1.39.3 by running jfrog -v? Also, does this issue disappear with version 1.38.4?

Lykathia commented 4 years ago

@eyalbe4 yep, I confirmed it was 1.39.3 before posting. But for completing the circle:

Versions I have observed this issue with: 1.35.5 1.37.1 1.38.2 1.38.4 1.39.3

eyalbe4 commented 4 years ago

Thanks for this information @Lykathia. This means the issue is not related to the recent changes. We'll try reproducing this with GitHub Actions. Are you using a hosted agent? Which type?

Lykathia commented 4 years ago

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2

      - name: Setup Jfrog CLI
        uses: jfrog/setup-jfrog-cli@v1
        with:
          VERSION: 1.39.3
        env:
          JF_ARTIFACTORY_1: ${{ secrets.ARTIFACTORY_CONFIG }}

      - name: Install Dependencies
        run: |
          jfrog rt npmc --server-id-resolve ${{ env.ARTIFACTORY_SERVER }} --server-id-deploy ${{ env.ARTIFACTORY_SERVER }} --repo-resolve npm --repo-deploy ${{ env.ARTIFACTORY_BUCKET }}
          jfrog rt npmci
        env:
          JFROG_CLI_BUILD_NAME: ${{ github.repository }}

Or-Geva commented 4 years ago

@Lykathia, I couldn't reproduce your issue on this project with your workflow. It will help if you share your project with me to get to the root cause of this issue.

Lykathia commented 4 years ago

Can do. Long weekend here atm, and I'll have to sanitize a bunch of stuff out of the repo. So will be mid-week sometime. I do have a tcpdump of the run as well, if that helps.

Lykathia commented 4 years ago

Sorry for the delay, reproduction repo here

EDIT: If it matters, the secret is a regular user/apikey input as such

bash-4.4# jfrog rt c --interactive=true
Artifactory server ID: bt
JFrog Artifactory URL: https://artifactory.dev.beyondtrust.com/artifactory/
JFrog Distribution URL (Optional): 
Access token (Leave blank for username and password/API key): 
User: <redacted>
Password/API key: 
Replace username and password/API key with automatically created access token that’s refreshed hourly? (y/n): n
Is the Artifactory reverse proxy configured to accept a client certificate? (y/n): n

[Info] Encrypting password...
bash-4.4# jfrog rt c export bt

Or-Geva commented 4 years ago

Thanks for sharing your repo with me, I have been trying to reproduce this on my Github action /azure DevOps using my SAAS Artifactory, but all my tests were successful. Moreover, by reviewing your GitHub Actions output I have noticed a timeout error and that can be related to our retry mechanism. When a request fails, the retry mechanism kicks off and tries to send the same request again and again(max of 3 times), or in your case, hundreds of retries requests for all the dependencies. If that is the case then, the network was flooded and the all process failed for such massive retries requests. Now, the best way to go about this is to add --threads=1 to the npmi / npmci commands which will limit the retry mechanism and will slow down the process. If this solution works we will try to lower the total requests being sent. Looking forward to hearing from you!.

Lykathia commented 4 years ago

Didn't seem to resolve it: https://github.com/Lykathia/snat-reproduction/runs/1132103019?check_suite_focus=true

Perhaps there is a weird interplay when a custom domain is being used? Altho that seems pretty crazy to me. The tcpdump logs do point towards port exhaustion, and I'm getting consistent replication across repositories... perhaps it is a server issue instead :/

Should I bundle all this stuff up and toss it towards support?

eyalbe4 commented 4 years ago

Here's what I suggest @Lykathia, We will improve the design, and reduce the total of requests sent to Artifactory. We'll do this by batching multiple npm packages into a single request. This is probably a good thing to do anyway. Once done, we'll see how this change affects your environment. In case it doesn't resolve the issue, we may use the Support Team's assistance. Having less requests sent, should make troubleshooting easier anyway.

Lykathia commented 4 years ago

@eyalbe4 any updates to this by chance?

Lykathia commented 3 years ago

Just tested against 1.44.0 -- still an issue.

jfrog / jfrog-client-go

SNAT exhaustion on Azure / Github Actions #195