Open Lykathia opened 4 years ago
@Lykathia,
Thanks for the deep drill down into the code!
I think you are correct and may have found the root cause of the issue. The issue seems to be related to a massive refactor which both the jfrog-client-go
and jfrog-cli
projects went through recently. The fix for this issue should be in the jfrog-cli
code base only though, because the body of the request should probably be closed there.
The reason why the body is left open, is to allow the API which sent the request to read the body and then close it.
I suspect that code with the missing close()
is below. Notice the defer stream.Close()
- it is missing with the latest released code. I added it below to show what should be fixed. I think this should fix the issue.
// Creates a function that fetches dependency data from Artifactory. Can be applied from a producer-consumer mechanism
func (nca *NpmCommandArgs) createGetDependencyInfoFunc(servicesManager *artifactory.ArtifactoryServicesManager) getDependencyInfoFunc {
return func(dependencyIndex string) parallel.TaskFunc {
return func(threadId int) error {
name := nca.dependencies[dependencyIndex].name
ver := nca.dependencies[dependencyIndex].version
log.Debug(clientutils.GetLogMsgPrefix(threadId, false), "Fetching checksums for", name, "-", ver)
stream, err := servicesManager.Aql(serviceutils.CreateAqlQueryForNpm(name, ver))
if err != nil {
return err
}
defer stream.Close()
result, err := ioutil.ReadAll(stream)
if err != nil {
return err
}
parsedResult := new(aqlResult)
if err = json.Unmarshal(result, parsedResult); err != nil {
return errorutils.CheckError(err)
}
if len(parsedResult.Results) == 0 {
log.Debug(clientutils.GetLogMsgPrefix(threadId, false), name, "-", ver, "could not be found in Artifactory.")
return nil
}
nca.dependencies[dependencyIndex].artifactName = parsedResult.Results[0].Name
nca.dependencies[dependencyIndex].checksum =
&buildinfo.Checksum{Sha1: parsedResult.Results[0].Actual_sha1, Md5: parsedResult.Results[0].Actual_md5}
log.Debug(clientutils.GetLogMsgPrefix(threadId, false), "Found", parsedResult.Results[0].Name,
"sha1:", parsedResult.Results[0].Actual_sha1,
"md5", parsedResult.Results[0].Actual_md5)
return nil
}
}
}
It appears that this issue requires an immediate fix. We will issue a patch soon!
@Lykathia, JFrog-CLI 1.39.3 is released and includes a fix for this issue. Feel free to update your JFrog-CLI and share your feedback.
@Or-Gabay this does not fix the issue, ports are still held open w/ exhaustion on all the ports, resulting in a TCP timeout.
@eyalbe4 I think I would have looked @ fixing the code here, using a functional closure to do the work inside the scope of the function being executed, instead of pushing responsibility for closure back up the stack. Or alternatively, doing an IO copy, closing the connection, and passing the data back up the stack. Dealing with all the pointers/references/threads across multiple go libraries seems like a more difficult problem then I would want to tackle! :D
Some different thoughts that floated to mind: Batching queries, to make less roundtrips, or putting in retries on failure.
Doesn't solve the TCP stuff directly, but does increase the overall robustness of the functionality in other ways (that would mitigate SNAT exhaustion possibilities).
@Lykathia,
Can you please confirm that you are using version 1.39.3 by running jfrog -v
?
Also, does this issue disappear with version 1.38.4?
@eyalbe4 yep, I confirmed it was 1.39.3 before posting. But for completing the circle:
Versions I have observed this issue with: 1.35.5 1.37.1 1.38.2 1.38.4 1.39.3
Thanks for this information @Lykathia. This means the issue is not related to the recent changes. We'll try reproducing this with GitHub Actions. Are you using a hosted agent? Which type?
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Setup Jfrog CLI
uses: jfrog/setup-jfrog-cli@v1
with:
VERSION: 1.39.3
env:
JF_ARTIFACTORY_1: ${{ secrets.ARTIFACTORY_CONFIG }}
- name: Install Dependencies
run: |
jfrog rt npmc --server-id-resolve ${{ env.ARTIFACTORY_SERVER }} --server-id-deploy ${{ env.ARTIFACTORY_SERVER }} --repo-resolve npm --repo-deploy ${{ env.ARTIFACTORY_BUCKET }}
jfrog rt npmci
env:
JFROG_CLI_BUILD_NAME: ${{ github.repository }}
Can do. Long weekend here atm, and I'll have to sanitize a bunch of stuff out of the repo. So will be mid-week sometime. I do have a tcpdump of the run as well, if that helps.
Sorry for the delay, reproduction repo here
EDIT: If it matters, the secret is a regular user/apikey input as such
bash-4.4# jfrog rt c --interactive=true
Artifactory server ID: bt
JFrog Artifactory URL: https://artifactory.dev.beyondtrust.com/artifactory/
JFrog Distribution URL (Optional):
Access token (Leave blank for username and password/API key):
User: <redacted>
Password/API key:
Replace username and password/API key with automatically created access token that’s refreshed hourly? (y/n): n
Is the Artifactory reverse proxy configured to accept a client certificate? (y/n): n
[Info] Encrypting password...
bash-4.4# jfrog rt c export bt
Thanks for sharing your repo with me, I have been trying to reproduce this on my Github action /azure DevOps using my SAAS Artifactory, but all my tests were successful.
Moreover, by reviewing your GitHub Actions output I have noticed a timeout error and that can be related to our retry mechanism. When a request fails, the retry mechanism kicks off and tries to send the same request again and again(max of 3 times), or in your case, hundreds of retries requests for all the dependencies.
If that is the case then, the network was flooded and the all process failed for such massive retries requests.
Now, the best way to go about this is to add --threads=1
to the npmi / npmci
commands which will limit the retry mechanism and will slow down the process. If this solution works we will try to lower the total requests being sent.
Looking forward to hearing from you!.
Didn't seem to resolve it: https://github.com/Lykathia/snat-reproduction/runs/1132103019?check_suite_focus=true
Perhaps there is a weird interplay when a custom domain is being used? Altho that seems pretty crazy to me. The tcpdump logs do point towards port exhaustion, and I'm getting consistent replication across repositories... perhaps it is a server issue instead :/
Should I bundle all this stuff up and toss it towards support?
Here's what I suggest @Lykathia, We will improve the design, and reduce the total of requests sent to Artifactory. We'll do this by batching multiple npm packages into a single request. This is probably a good thing to do anyway. Once done, we'll see how this change affects your environment. In case it doesn't resolve the issue, we may use the Support Team's assistance. Having less requests sent, should make troubleshooting easier anyway.
@eyalbe4 any updates to this by chance?
Just tested against 1.44.0 -- still an issue.
Hey all,
It appears that due to the
leaveBodyOpen
style functions, ports aren't getting reused and causing SNAT port exhaustion when being run w/ a large number of requests on AzDO or GHA.Specifically, I came across this while trying to run a npm install from the jfrog cli, w/ build metadata enabled. Since there is an individual AQL call made for every single dependency to retrieve the checksum for said dependency - it quickly saturates the ports and causes a TCP timeout.
I had looked @ fixing this by writing 'better' AQL in the jfrog CLI, but it appears to cause a segfault on the Artifactory server side due to payload being too large. So instead I started tracking down the network calls, and was led here.
A brief search thru the git history didn't appear to turn up anything -- I'm wondering what the use case for the 'leave body open' connections is? An easy fix would be to not use it - this is certainly a hindrance for any NPM style builds w/ build-info enabled.