travis-ci / travis.rb

Travis CI Client (CLI and Ruby library)
MIT License
1.59k stars 409 forks source link

travis lint fails with intermittent "invalid access token" errors #732

Open ferrarimarco opened 4 years ago

ferrarimarco commented 4 years ago

Hi! I've an issue with the travis lint command.

I run it as a build step on Travis CI, and it sometimes fail with the following error:

$ echo "n" | travis lint -x
Shell completion not installed. Would you like to install it now? |y| invalid access token - try running travis login
The command "echo "n" | travis lint -x" exited with 1.

Most of the times, if I restart the build step/job, it succeeds, with no errors.

So, I think we actually have two issues:

  1. travis lint shouldn't need any token.
  2. travis lint intermittently fails with this error.

Can you assist? Thanks!

Example build: https://travis-ci.com/github/ferrarimarco/kubernetes-playground/jobs/316452382

Gem installation output:

7.91s$ gem install travis --no-document
Fetching: multipart-post-2.1.1.gem (100%)
Successfully installed multipart-post-2.1.1
Fetching: faraday-0.17.3.gem (100%)
Successfully installed faraday-0.17.3
Fetching: faraday_middleware-0.14.0.gem (100%)
Successfully installed faraday_middleware-0.14.0
Fetching: highline-1.7.10.gem (100%)
Successfully installed highline-1.7.10
Fetching: backports-3.17.0.gem (100%)
Successfully installed backports-3.17.0
Fetching: concurrent-ruby-1.1.6.gem (100%)
Successfully installed concurrent-ruby-1.1.6
Fetching: i18n-1.8.2.gem (100%)
HEADS UP! i18n 1.1 changed fallbacks to exclude default locale.
But that may break your application.
If you are upgrading your Rails application from an older version of Rails:
Please check your Rails app for 'config.i18n.fallbacks = true'.
If you're using I18n (>= 1.1.0) and Rails (< 5.2.2), this should be
'config.i18n.fallbacks = [I18n.default_locale]'.
If not, fallbacks will be broken in your app by I18n 1.1.x.
If you are starting a NEW Rails application, you can ignore this notice.
For more info see:
https://github.com/svenfuchs/i18n/releases/tag/v1.1.0
Successfully installed i18n-1.8.2
Fetching: thread_safe-0.3.6.gem (100%)
Successfully installed thread_safe-0.3.6
Fetching: tzinfo-1.2.7.gem (100%)
Successfully installed tzinfo-1.2.7
Fetching: activesupport-5.2.4.2.gem (100%)
Successfully installed activesupport-5.2.4.2
Fetching: multi_json-1.14.1.gem (100%)
Successfully installed multi_json-1.14.1
Fetching: public_suffix-4.0.4.gem (100%)
Successfully installed public_suffix-4.0.4
Fetching: addressable-2.7.0.gem (100%)
Successfully installed addressable-2.7.0
Fetching: net-http-persistent-2.9.4.gem (100%)
Successfully installed net-http-persistent-2.9.4
Fetching: net-http-pipeline-1.0.1.gem (100%)
Successfully installed net-http-pipeline-1.0.1
Fetching: gh-0.16.0.gem (100%)
Successfully installed gh-0.16.0
Fetching: launchy-2.5.0.gem (100%)
Successfully installed launchy-2.5.0
Fetching: ffi-1.12.2.gem (100%)
Building native extensions. This could take a while...
Successfully installed ffi-1.12.2
Fetching: ethon-0.12.0.gem (100%)
Successfully installed ethon-0.12.0
Fetching: typhoeus-0.8.0.gem (100%)
Successfully installed typhoeus-0.8.0
Fetching: websocket-1.2.8.gem (100%)
Successfully installed websocket-1.2.8
Fetching: pusher-client-0.6.2.gem (100%)
Successfully installed pusher-client-0.6.2
Fetching: travis-1.8.13.gem (100%)
Successfully installed travis-1.8.13
23 gems installed
BanzaiMan commented 4 years ago

If your command is talking to .org API endpoint, you should not need auth. The "auth" error message could be a red herring.

Here, I remove any existing travis.rb configuration and invoke the lint command:

$ rm -rf ~/.travis && travis lint --debug --debug-http --skip-completion-check -x travis_production_test/.travis.yml
** reading travis_production_test/.travis.yml
** POST "lint" {"content"=>"language: ruby\ndist: bionic\n\nrvm: 2.5.4\n\ninstall: skip\nscript: skip\n\njobs:\n  include:\n    - stage: test\n    - stage: deVeloP\n    - stage: Test\n    - stage: develop\n\n# matrix:\n#   include:\n#     - language: generic\n#     - language: java\n    # - sudo: false\n    #   dist: precise\n    #   env: DIST=precise STACK=EC2\n    # - sudo: required\n    #   dist: precise\n    #   env: DIST=precise STACK=GCE\n    # - sudo: false\n    #   dist: trusty\n    #   env: DIST=trusty STACK=EC2\n    # - sudo: required\n    #   dist: trusty\n    #   env: DIST=trusty STACK=GCE\n    # - sudo: false\n    #   dist: trusty\n    #   group: edge\n    #   env: DIST=trusty STACK=EC2 GROUP=edge\n    # - sudo: required\n    #   dist: trusty\n    #   group: edge\n    #   env: DIST=trusty STACK=GCE GROUP=edge\n#     - os: osx\n\nnotifications:\n  email: false\n\n  #  slack:\n  #  rooms:\n  #    - secure: \"KZTfjxhkOdDDZYD2ThBILt2ShYtdEee3PudFZlQBfpxG9N470GfxnKVfOl7Odu4lN6cIupwC4S4BnehubEywSR99l60VmfhePUCTqDLYy+iR9+guTvaECdXdaau3jTKhFdu5u60ITsiw5Yb+hwAaGFO0HmQhqR9XVr/U8IGQlok=\"\n  #  template:\n  #    - \"Build {build_url}|#%{build_number}> (<%{compare_url}|%{commit}>) of %{repository}@%{branch} by %{author} %{result} in %{duration} (elapsed time: %{elapsed_time}) subject: %{commit_subject} message: %{commit_message} foobar\"\n  #  on_success: always\n"}
**   took 0.11 seconds
Hooray, travis_production_test/.travis.yml looks valid :)
** Storing "/home/travis/.travis/config.yml"

.com API endpoint requires auth:

$ rm -rf ~/.travis && travis lint --debug --debug-http --skip-completion-check -x travis_production_test/.travis.yml --pro
not logged in, please run travis login --pro

The intermittent lint failures are a problem, and this is not the only report I've seen. Unfortunately, I have never seen one firsthand, and at the moment it is not clear to me why it might fail.

ljharb commented 4 years ago

Wouldn't jobs in .com already run travis login --pro implicitly then?

ferrarimarco commented 4 years ago

Hi!

The travis validate command shouldn't be talking to any API to do its job, right? Or it's not the case?

BanzaiMan commented 4 years ago

@ljharb It does not do that at the moment. Whether it should is a valid (but separate) UX question. To be honest, I'd never thought it should, until you raised it.

@ferrarimarco I assume you mean travis lint, not travis validate. travis lint works by talking to the API end point POST /lint with required payload. As we have seen, this behavior seems to be a little unreliable. I'll dig a little deeper.

ferrarimarco commented 4 years ago

Yes! I meant lint, not validate. Thanks!

I was expecting the linting to work in an airgapped environment, like many other linting tools, that have the logic builtin.

ferrarimarco commented 4 years ago

but this is not the point I guess :)

travi commented 4 years ago

i was pointed to this issue late yesterday from a support issue (#16611 for those that have internal access) and am glad to hear that i'm not the only one that has been frustrated by this. i opened the support issue on April 4th after seeing the failures for a bit before that.

however, my case is slightly different, so hopefully the difference helps eliminate a few things, if they haven't been already. rather than using the travis gem, i'm using the node module travis-lint in most of my js projects since it calls the same api as the gem and doesn't require devs on my team to have the same gem version installed across all of their machines.

when i reached out to support, i initially assumed that, with the migration of projects from .org to .com, i would be told there was a deprecation involved to encourage migration of these calls from the .org api to a .com api, or something along those lines. it clearly wasnt an issue of the old api being turned off since the failures are so inconsistent, but i assumed brown-outs were probably being used to get attention and nudge the desired migration a bit. this seemed even more likely to me since I don't remember running into this error in my local builds (which i run often). however, after two months, i don't get the impression that is really the case since the support team at least doesn't seem to be aware of something along those lines.

since travis-lint doesn't handle errors, i didnt initially have more information, hoping that the support request would give me some quick insight. the travis-lint module hasnt changed in over 4 years, so i wasn't wanting to dig too deep into the module before i had some insight into whether the module had any impact on the problems (seemed unlikely since it hadnt changed in so long). i did end up forking the module in a few of my projects to at least print the error message, which has been "Forbidden" in all of the cases that I have caught since. i've most recently included NODE_DEBUG=request in a few of my projects to get more detail that was requested. here are a few cases that I have seen since:

also, this seemed to start happening at the same time as another issue I was having around that time (linked to in the ticket that i mentioned above) related to encrypted secrets in shareable configs. from an outsider perspective, it seems like there was likely a large feature release around that time that changed enough things that allowed a few regressions like this to slip through.

hopefully some of this detail is helpful in narrowing down the issue.

BanzaiMan commented 4 years ago

I still cannot reason why /lint endpoint should enforce auth (and respond with 403):

https://github.com/travis-ci/travis-api/blob/3fb913cdcdcc940466112c21e11a546a643b56cc/lib/travis/api/app/endpoint/lint.rb#L8-L9

Can you examine the response body of these 403 requests? It's supposed to be only 10 bytes, which seems too short for most (if not all) of the error message api should be generating.

Thanks.

travi commented 4 years ago

unfortunately the debug mode of request isn't configurable, so it will take a bit of digging to get that. without getting that far yet, it seems likely that the content is simply "Forbidden", maybe with a newline.

is it possible that this is being introduced by a load balancer or something else in the network stack? like i mentioned above, i dont remember ever seeing this error locally. even though it is random, i run things locally often enough that it seems very unlikely that i wouldn't have hit this yet, as often as i see it when running on travis-ci.com. also, when it does happen, there seems to be a cluster of other projects failing for the same reason around the same time.

i obviously dont know any details of your infrastructure layout, but it seems a bit like traffic from .com to the .org api is behaving differently than truly external traffic to the same api.

travi commented 4 years ago

it seems likely that the content is simply "Forbidden", maybe with a newline

i think this is confirmed by https://github.com/pwmckenna/node-travis-ci/blob/75204638777ae5f0fba6ba889f4b085277ae21b9/lib/travis-http.js#L45-L46 where the err is passed to the callback from https://github.com/pwmckenna/node-travis-lint/blob/05423d6a447c8a2cb9fabadd3386e966e6269b76/lib/travis-lint.js#L17 as res.body

BanzaiMan commented 4 years ago

Maybe there is something else at play.

In the successful build, the response looks completely different. Different Content-Type, different set of headers, etc.

https://travis-ci.com/github/form8ion/ruby-scaffolder/builds/168660399#L291

travi commented 4 years ago

i am unsure if this is related, but thought it was worth at least capturing. i finally saw a failure locally, but the output is different than i've seen in the build logs when run on ci, so i think this failure is at least somewhat different. running again immediately after this failure, passed as expected.

[lint:travis   ] undefined:1
[lint:travis   ] Invalid JSON in request body
[lint:travis   ] ^
[lint:travis   ] 
[lint:travis   ] SyntaxError: Unexpected token I in JSON at position 0
[lint:travis   ]     at JSON.parse (<anonymous>)
[lint:travis   ]     at Request._callback (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/travis-ci/lib/travis-http.js:42:29)
[lint:travis   ]     at Request.self.callback (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/request/request.js:185:22)
[lint:travis   ]     at Request.emit (events.js:315:20)
[lint:travis   ]     at Request.<anonymous> (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/request/request.js:1154:10)
[lint:travis   ]     at Request.emit (events.js:315:20)
[lint:travis   ]     at IncomingMessage.<anonymous> (/Users/travi/development/form8ion/mocha-scaffolder/node_modules/request/request.js:1076:12)
[lint:travis   ]     at Object.onceWrapper (events.js:421:28)
[lint:travis   ]     at IncomingMessage.emit (events.js:327:22)
[lint:travis   ]     at endReadableNT (_stream_readable.js:1221:12)
ljharb commented 4 years ago

Is there any chance the endpoint is getting rate limited, based on how many times everyone is trying to run travis lint at one time?

travi commented 4 years ago

is this still being investigated? i seem to be having this happen significantly more today than previously. i normally could at least restart a build and, while still very annoying and should be unnecessary, usually have it succeed. today, i'm not having success, even after several restart attempts.

travi commented 4 years ago

@BanzaiMan can you give any update on this issue at all? This is still a big problem and is not limited to the Ruby gem

BanzaiMan commented 4 years ago

@travi Could you point to what you are observing? Certain kind of input to the command reliably leads to errors (on API, which is an issue in its own right), but I can't reproduce the "intermittent" errors.

travis@0ce1faa8192d:~$ while true; do echo "n" | travis lint -x; done
server error (500: "Sorry, we experienced an error.\n\nrequest_id:2df1e207e2c0d6640f96e0f43c9c7442\n")
server error (500: "Sorry, we experienced an error.\n\nrequest_id:0a1bf3320255c364855a3db71cba376e\n")
server error (500: "Sorry, we experienced an error.\n\nrequest_id:2aa6a23e05566020b00b62da05ac3c8d\n")
server error (500: "Sorry, we experienced an error.\n\nrequest_id:1562cc59d6a993a7214d3a03d6e6af5e\n")
⋮
travis@0ce1faa8192d:~$ while true; do echo "{}" | travis lint -x; done
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
⋮
travis@0ce1faa8192d:~$ while true; do echo "a: b" | travis lint -x; done
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
Warnings for STDIN:
[x] [warn] on root: unknown key a (b)
⋮
travis@0ce1faa8192d:~$ while true; do echo "language: ruby" | travis lint -x; done
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
Hooray, STDIN looks valid :)
⋮

None of these are authenticated.

travis@0ce1faa8192d:~$ cat ~/.travis/config.yml
---
last_check:
  etag: '"b2d547ae1df5207852df42a8c407008a"'
  version: 1.9.1
  at: 1596658398
checked_completion: true
completion_version: 1.9.1
endpoints:
  https://api.travis-ci.org/:
    access_token:
travi commented 4 years ago

@BanzaiMan i'm not sure how to provide the information you are suggesting beyond what i've already provided above and in the support issue that i mentioned. if there is specific information that you think i can send that i haven't already, i think i need help understanding what that is.

if reproducing/observing the intermittent nature of this is the problem, is it possible to just follow my account or my form8ion org internally for a few days somehow? i have at least a handful of these failures almost every day often github PRs result in one of the two builds passing and the other failing. these PRs make no changes to the .travis.yml file, so there should be no reason for a different result between the two builds.

as mentioned above, it seems most likely that the failure response is somewhere in front of the actual lint service rather than the service itself, possibly the result of something like rate limiting.

here is a sample of failures from just today. it is worth noting that i had several other PRs for the same dependency updates that had both builds pass. i will leave the failures alone for now, but would like to correct them sometime soon. unfortunately, restarting the builds does not leave the old result behind in a linkable way, so that complicates reporting this behavior in a way that is accessible over time. i would appreciate it if you could investigate these builds relatively quickly so that i can fix them by restarting them as a follow up:

please let me know if this information is helpful beyond what has already been provided.

ljharb commented 4 years ago

https://travis-ci.com/github/ljharb/travis-ci/builds/177692185

travi commented 4 years ago

@BanzaiMan more from today:

travi commented 4 years ago

@BanzaiMan

BanzaiMan commented 4 years ago

@travi Can you record the payload and the actual return value from POST https://api.travis-ci.com/lint? I am quite puzzled how res can be undefined at https://travis-ci.com/github/form8ion/remark-update-legacy-badge-markers/builds/179642627#L287 when it's handed down from

https://github.com/pwmckenna/node-travis-lint/blob/master/lib/travis-lint.js#L22

In my testing, when the input is usable (even when it's empty), we should be returning a data structure that includes lint.warnings:

$ curl -s -X POST  https://api.travis-ci.com/lint
{"lint":{"warnings":[]}}
$ curl -s -X POST -d "language: ruby" https://api.travis-ci.com/lint
{"lint":{"warnings":[]}}

Only when it's egregiously wrong, do we return something catastrophic (which should be fixed)

$ curl -s -X POST -d "foo" https://api.travis-ci.com/lint
Sorry, we experienced an error.

request_id:6e314394aa334c75e3bf98a1eaae5ed5
travi commented 4 years ago

Can you record the payload and the actual return value from POST https://api.travis-ci.com/lint?

quite honestly, no. as you can see, i'm observing these errors distributed across many projects in ways that are very unpredictable. it would be a significant investment for me to find a way to record additional information like this across enough of my projects to gather useful data. I've already traced through most of the code stack for the processing of this request above and provided additional debugging output that i've also linked above.

Only when it's egregiously wrong, do we return something catastrophic (which should be fixed)

again, this appears to be behavior unrelated to the lint service itself, along the lines of rate limiting in the network stack before the request reaches the actual service. you being this puzzled by how the service could return this type of response suggests this even further in my mind. has there been investigation into this issue outside of the actual lint service?

is it possible to just follow my account or my form8ion org internally for a few days somehow?

i could provide lists like i have been recently almost every day. these accounts have plenty of builds running each day that observing them without me in the loop should provide enough examples to identify issues. the tooling you must have internally should be much better at tracing network calls than anything i could possibly add on my end.