Open ferrarimarco opened 4 years ago
If your command is talking to .org API endpoint, you should not need auth. The "auth" error message could be a red herring.
Here, I remove any existing travis.rb
configuration and invoke the lint command:
$ rm -rf ~/.travis && travis lint --debug --debug-http --skip-completion-check -x travis_production_test/.travis.yml
** reading travis_production_test/.travis.yml
** POST "lint" {"content"=>"language: ruby\ndist: bionic\n\nrvm: 2.5.4\n\ninstall: skip\nscript: skip\n\njobs:\n include:\n - stage: test\n - stage: deVeloP\n - stage: Test\n - stage: develop\n\n# matrix:\n# include:\n# - language: generic\n# - language: java\n # - sudo: false\n # dist: precise\n # env: DIST=precise STACK=EC2\n # - sudo: required\n # dist: precise\n # env: DIST=precise STACK=GCE\n # - sudo: false\n # dist: trusty\n # env: DIST=trusty STACK=EC2\n # - sudo: required\n # dist: trusty\n # env: DIST=trusty STACK=GCE\n # - sudo: false\n # dist: trusty\n # group: edge\n # env: DIST=trusty STACK=EC2 GROUP=edge\n # - sudo: required\n # dist: trusty\n # group: edge\n # env: DIST=trusty STACK=GCE GROUP=edge\n# - os: osx\n\nnotifications:\n email: false\n\n # slack:\n # rooms:\n # - secure: \"KZTfjxhkOdDDZYD2ThBILt2ShYtdEee3PudFZlQBfpxG9N470GfxnKVfOl7Odu4lN6cIupwC4S4BnehubEywSR99l60VmfhePUCTqDLYy+iR9+guTvaECdXdaau3jTKhFdu5u60ITsiw5Yb+hwAaGFO0HmQhqR9XVr/U8IGQlok=\"\n # template:\n # - \"Build {build_url}|#%{build_number}> (<%{compare_url}|%{commit}>) of %{repository}@%{branch} by %{author} %{result} in %{duration} (elapsed time: %{elapsed_time}) subject: %{commit_subject} message: %{commit_message} foobar\"\n # on_success: always\n"}
** took 0.11 seconds
Hooray, travis_production_test/.travis.yml looks valid :)
** Storing "/home/travis/.travis/config.yml"
.com API endpoint requires auth:
$ rm -rf ~/.travis && travis lint --debug --debug-http --skip-completion-check -x travis_production_test/.travis.yml --pro
not logged in, please run travis login --pro
The intermittent lint failures are a problem, and this is not the only report I've seen. Unfortunately, I have never seen one firsthand, and at the moment it is not clear to me why it might fail.
Wouldn't jobs in .com already run travis login --pro
implicitly then?
Hi!
The travis validate
command shouldn't be talking to any API to do its job, right? Or it's not the case?
@ljharb It does not do that at the moment. Whether it should is a valid (but separate) UX question. To be honest, I'd never thought it should, until you raised it.
@ferrarimarco I assume you mean travis lint
, not travis validate
. travis lint
works by talking to the API end point POST /lint
with required payload. As we have seen, this behavior seems to be a little unreliable. I'll dig a little deeper.
Yes! I meant lint, not validate. Thanks!
I was expecting the linting to work in an airgapped environment, like many other linting tools, that have the logic builtin.
but this is not the point I guess :)
i was pointed to this issue late yesterday from a support issue (#16611
for those that have internal access) and am glad to hear that i'm not the only one that has been frustrated by this. i opened the support issue on April 4th after seeing the failures for a bit before that.
however, my case is slightly different, so hopefully the difference helps eliminate a few things, if they haven't been already. rather than using the travis gem, i'm using the node module travis-lint in most of my js projects since it calls the same api as the gem and doesn't require devs on my team to have the same gem version installed across all of their machines.
when i reached out to support, i initially assumed that, with the migration of projects from .org to .com, i would be told there was a deprecation involved to encourage migration of these calls from the .org api to a .com api, or something along those lines. it clearly wasnt an issue of the old api being turned off since the failures are so inconsistent, but i assumed brown-outs were probably being used to get attention and nudge the desired migration a bit. this seemed even more likely to me since I don't remember running into this error in my local builds (which i run often). however, after two months, i don't get the impression that is really the case since the support team at least doesn't seem to be aware of something along those lines.
since travis-lint
doesn't handle errors, i didnt initially have more information, hoping that the support request would give me some quick insight. the travis-lint
module hasnt changed in over 4 years, so i wasn't wanting to dig too deep into the module before i had some insight into whether the module had any impact on the problems (seemed unlikely since it hadnt changed in so long). i did end up forking the module in a few of my projects to at least print the error message, which has been "Forbidden" in all of the cases that I have caught since. i've most recently included NODE_DEBUG=request
in a few of my projects to get more detail that was requested. here are a few cases that I have seen since:
also, this seemed to start happening at the same time as another issue I was having around that time (linked to in the ticket that i mentioned above) related to encrypted secrets in shareable configs. from an outsider perspective, it seems like there was likely a large feature release around that time that changed enough things that allowed a few regressions like this to slip through.
hopefully some of this detail is helpful in narrowing down the issue.
I still cannot reason why /lint
endpoint should enforce auth (and respond with 403):
Can you examine the response body of these 403 requests? It's supposed to be only 10 bytes, which seems too short for most (if not all) of the error message api should be generating.
Thanks.
unfortunately the debug mode of request
isn't configurable, so it will take a bit of digging to get that. without getting that far yet, it seems likely that the content is simply "Forbidden", maybe with a newline.
is it possible that this is being introduced by a load balancer or something else in the network stack? like i mentioned above, i dont remember ever seeing this error locally. even though it is random, i run things locally often enough that it seems very unlikely that i wouldn't have hit this yet, as often as i see it when running on travis-ci.com. also, when it does happen, there seems to be a cluster of other projects failing for the same reason around the same time.
i obviously dont know any details of your infrastructure layout, but it seems a bit like traffic from .com to the .org api is behaving differently than truly external traffic to the same api.
it seems likely that the content is simply "Forbidden", maybe with a newline
i think this is confirmed by https://github.com/pwmckenna/node-travis-ci/blob/75204638777ae5f0fba6ba889f4b085277ae21b9/lib/travis-http.js#L45-L46 where the err
is passed to the callback from https://github.com/pwmckenna/node-travis-lint/blob/05423d6a447c8a2cb9fabadd3386e966e6269b76/lib/travis-lint.js#L17 as res.body
Maybe there is something else at play.
In the successful build, the response looks completely different. Different Content-Type
, different set of headers, etc.
https://travis-ci.com/github/form8ion/ruby-scaffolder/builds/168660399#L291
i am unsure if this is related, but thought it was worth at least capturing. i finally saw a failure locally, but the output is different than i've seen in the build logs when run on ci, so i think this failure is at least somewhat different. running again immediately after this failure, passed as expected.
[lint:travis ] undefined:1
[lint:travis ] Invalid JSON in request body
[lint:travis ] ^
[lint:travis ]
[lint:travis ] SyntaxError: Unexpected token I in JSON at position 0
[lint:travis ] at JSON.parse (<anonymous>)
[lint:travis ] at Request._callback (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/travis-ci/lib/travis-http.js:42:29)
[lint:travis ] at Request.self.callback (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/request/request.js:185:22)
[lint:travis ] at Request.emit (events.js:315:20)
[lint:travis ] at Request.<anonymous> (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/request/request.js:1154:10)
[lint:travis ] at Request.emit (events.js:315:20)
[lint:travis ] at IncomingMessage.<anonymous> (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/request/request.js:1076:12)
[lint:travis ] at Object.onceWrapper (events.js:421:28)
[lint:travis ] at IncomingMessage.emit (events.js:327:22)
[lint:travis ] at endReadableNT (_stream_readable.js:1221:12)
Is there any chance the endpoint is getting rate limited, based on how many times everyone is trying to run travis lint
at one time?
is this still being investigated? i seem to be having this happen significantly more today than previously. i normally could at least restart a build and, while still very annoying and should be unnecessary, usually have it succeed. today, i'm not having success, even after several restart attempts.
@BanzaiMan can you give any update on this issue at all? This is still a big problem and is not limited to the Ruby gem
@travi Could you point to what you are observing? Certain kind of input to the command reliably leads to errors (on API, which is an issue in its own right), but I can't reproduce the "intermittent" errors.
travis@0ce1faa8192d:~$ while true; do echo "n" | travis lint -x; done
server error (500: "Sorry, we experienced an error.\n\nrequest_id:2df1e207e2c0d6640f96e0f43c9c7442\n")
server error (500: "Sorry, we experienced an error.\n\nrequest_id:0a1bf3320255c364855a3db71cba376e\n")
server error (500: "Sorry, we experienced an error.\n\nrequest_id:2aa6a23e05566020b00b62da05ac3c8d\n")
server error (500: "Sorry, we experienced an error.\n\nrequest_id:1562cc59d6a993a7214d3a03d6e6af5e\n")
⋮
travis@0ce1faa8192d:~$ while true; do echo "{}" | travis lint -x; done
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
⋮
travis@0ce1faa8192d:~$ while true; do echo "a: b" | travis lint -x; done
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
⋮
travis@0ce1faa8192d:~$ while true; do echo "language: ruby" | travis lint -x; done
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
⋮
None of these are authenticated.
travis@0ce1faa8192d:~$ cat ~/.travis/config.yml
---
last_check:
etag: '"b2d547ae1df5207852df42a8c407008a"'
version: 1.9.1
at: 1596658398
checked_completion: true
completion_version: 1.9.1
endpoints:
https://api.travis-ci.org/:
access_token:
@BanzaiMan i'm not sure how to provide the information you are suggesting beyond what i've already provided above and in the support issue that i mentioned. if there is specific information that you think i can send that i haven't already, i think i need help understanding what that is.
if reproducing/observing the intermittent nature of this is the problem, is it possible to just follow my account or my form8ion
org internally for a few days somehow? i have at least a handful of these failures almost every day often github PRs result in one of the two builds passing and the other failing. these PRs make no changes to the .travis.yml
file, so there should be no reason for a different result between the two builds.
as mentioned above, it seems most likely that the failure response is somewhere in front of the actual lint service rather than the service itself, possibly the result of something like rate limiting.
here is a sample of failures from just today. it is worth noting that i had several other PRs for the same dependency updates that had both builds pass. i will leave the failures alone for now, but would like to correct them sometime soon. unfortunately, restarting the builds does not leave the old result behind in a linkable way, so that complicates reporting this behavior in a way that is accessible over time. i would appreciate it if you could investigate these builds relatively quickly so that i can fix them by restarting them as a follow up:
please let me know if this information is helpful beyond what has already been provided.
@BanzaiMan more from today:
@BanzaiMan
@travi Can you record the payload and the actual return value from POST https://api.travis-ci.com/lint
? I am quite puzzled how res
can be undefined at https://travis-ci.com/github/form8ion/remark-update-legacy-badge-markers/builds/179642627#L287 when it's handed down from
https://github.com/pwmckenna/node-travis-lint/blob/master/lib/travis-lint.js#L22
In my testing, when the input is usable (even when it's empty), we should be returning a data structure that includes lint.warnings
:
$ curl -s -X POST https://api.travis-ci.com/lint
{"lint":{"warnings":[]}}
$ curl -s -X POST -d "language: ruby" https://api.travis-ci.com/lint
{"lint":{"warnings":[]}}
Only when it's egregiously wrong, do we return something catastrophic (which should be fixed)
$ curl -s -X POST -d "foo" https://api.travis-ci.com/lint
Sorry, we experienced an error.
request_id:6e314394aa334c75e3bf98a1eaae5ed5
Can you record the payload and the actual return value from POST https://api.travis-ci.com/lint?
quite honestly, no. as you can see, i'm observing these errors distributed across many projects in ways that are very unpredictable. it would be a significant investment for me to find a way to record additional information like this across enough of my projects to gather useful data. I've already traced through most of the code stack for the processing of this request above and provided additional debugging output that i've also linked above.
Only when it's egregiously wrong, do we return something catastrophic (which should be fixed)
again, this appears to be behavior unrelated to the lint service itself, along the lines of rate limiting in the network stack before the request reaches the actual service. you being this puzzled by how the service could return this type of response suggests this even further in my mind. has there been investigation into this issue outside of the actual lint service?
is it possible to just follow my account or my form8ion org internally for a few days somehow?
i could provide lists like i have been recently almost every day. these accounts have plenty of builds running each day that observing them without me in the loop should provide enough examples to identify issues. the tooling you must have internally should be much better at tracing network calls than anything i could possibly add on my end.
Hi! I've an issue with the
travis lint
command.I run it as a build step on Travis CI, and it sometimes fail with the following error:
Most of the times, if I restart the build step/job, it succeeds, with no errors.
So, I think we actually have two issues:
travis lint
shouldn't need any token.travis lint
intermittently fails with this error.Can you assist? Thanks!
Example build: https://travis-ci.com/github/ferrarimarco/kubernetes-playground/jobs/316452382
Gem installation output: