scitokens / scitokens-cpp

A C++ implementation of the SciTokens library with a C library interface
Apache License 2.0
5 stars 22 forks source link

"Failed to deserialize a token: Timeout was reached" - how to debug? #128

Open steve8x8 opened 1 year ago

steve8x8 commented 1 year ago
# scitokens-verify `cat /run/user/1902/bt_u1902`
Failed to deserialize a token: Timeout was reached

How to debug this? Debian Bullseye, libscitokens0 version 1.0.2

There's another machine with the same SciTokens setup that returns the correct response (same file content)

$ scitokens-verify `cat /run/user/1902/bt_u1902`
Token deserialization successful.

but I can't find the difference :/

jbasney commented 1 year ago

In #114 the "Timeout was reached" error was caused when trying to retrieve the issuer public key, so the first thing I'd check is that curl can get your issuer's well-known endpoint from the machine in question. For example:

$ curl https://demo.scitokens.org/.well-known/openid-configuration
{"claims_supported":["aud","exp","iat","iss","sub"],"device_authorization_endpoint":"https://demo.scitokens.org/oauth2/device_authorization","grant_types_supported":["authorization_code","refresh_token","urn:ietf:params:oauth:grant-type:token-exchange","urn:ietf:params:oauth:grant-type:device_code"],"id_token_signing_alg_values_supported":["RS256","RS384","RS512"],"issuer":"https://demo.scitokens.org","jwks_uri":"https://demo.scitokens.org/oauth2/certs","registration_endpoint":"https://demo.scitokens.org/oauth2/oidc-cm","response_modes_supported":["query","fragment","form_post"],"response_types_supported":["code","id_token"],"scopes_supported":["read:/","write:/"],"subject_types_supported":["public"],"token_endpoint":"https://demo.scitokens.org/oauth2/token"}
steve8x8 commented 1 year ago

Using "iss": "https://cilogon.org/igwn" from htdecodetoken output, I can connect, both curl https://cilogon.org/.well-known/openid-configuration and curl https://cilogon.org/igwn/.well-known/openid-configuration return some JSON code.

What's the second thing to check? (The token file is 1048 bytes in size, if that matters.) I'm about to run Wireshark on that machine to get a grip on what's being sent where (and meant to be received).

jbasney commented 1 year ago

I got a (large) token from https://cilogon.org/igwn on my Ubuntu 22.04.3 LTS VM to see if I could reproduce the problem:

$ apt-show-versions libscitokens0
libscitokens0:amd64/unknown 1.0.2-1+ubuntu22.04.0 uptodate
$ htdecodetoken | grep iss
  "iss": "https://cilogon.org/igwn",
$ wc ${XDG_RUNTIME_DIR:-/tmp}/bt_u`id -u`
   1    1 1220 /run/user/1000//bt_u1000
$ rm -rf .cache/scitokens
$ strace -o scitokens-trace.txt scitokens-verify `cat /run/user/1000/bt_u1000`
Token deserialization successful.

The strace output shows the keycache getting populated:

$ grep cache scitokens-trace.txt | head -5
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
mkdir("/home/jbasney/.cache", 0700)     = -1 EEXIST (File exists)
mkdir("/home/jbasney/.cache/scitokens", 0700) = 0
newfstatat(AT_FDCWD, "/home/jbasney/.cache/scitokens/scitokens_cpp.sqllite", 0x7ffeb24c35d0, AT_SYMLINK_NOFOLLOW) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/home/jbasney/.cache/scitokens/scitokens_cpp.sqllite", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 3

I see the issuer keys in the keycache:

$ sqlite3 ~/.cache/scitokens/scitokens_cpp.sqllite
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> select * from keycache;
https://cilogon.org/igwn|{"expires":1694442612,"jwks":{"keys":[{"alg":"RS256","e":"AQAB","kid":"2FEC35C60764D27594EBEC4D98824CD8","kty":"RSA","n":"AKtyEMIR2_ymESOtrak2QJfdQvVTip-QoC_fIgorqO47p_iBkkrvwRdyQoi_MyNi-5nLQ1pPScaL3niLRvgzx11jY7GoIil2gMN8NhY6dX94dCQC3Qsvo5C3tvla1SkfEGO8y52z8biReVQZy8_Hdy6l__IGveR6aQmC1tKecf1e-kJROS46yA9qpsNEtkREdwZAD9tmrDgddbfXAcCUtfdc62dEZTryuDcJGoXEaWBAobVIGIE5e7IYjsiZz_KtIOgaDqoguXeyttZB1-g2ok8PT9S3Xdt3gP4Wy3PE_qO58UCAv5YejG6Uau8XyxKfxGQrPnXERVJl3PQF1NCJj-8","use":"sig"},{"alg":"RS384","e":"AQAB","kid":"680182D6A0BF3BFF98E7DC211991EA44","kty":"RSA","n":"ALVOmsYrvRb8jzjyS99p2J22bP7Tvv9FGNzW62xDLKy77rVud0zMAUh7RJwUROmAoRoJ1soDIDXJH3-F4V_pZpI80_nGalXjX2SSbeNFtQVSilXdlS-CPwBE59A4lueDTvIg4925pI5PR50ShWQHMXnXU4xM5iFs9V0wtZsbKvs8Flsu2va_pnbhOb_3otEaz68iI2SI8srP-cz5e18l5SI-xtGie_IOWYkVN9Bs2O0ELcpI2hMT86m_dXJGUy5rMRpk6gTLGh3kis4wZN4OLrLvO_qTo3BU9I54nvCwiaQVWJwpZAq4Vf5jwXdrSCNnHLYOiPB1jix-Yor4JUJP8lk","use":"sig"},{"alg":"RS512","e":"AQAB","kid":"90387F85055ECED268E374A28414A056","kty":"RSA","n":"AItqzHE1vr_intX7uYEz5wf7Ppn2FSJUdhDn-GNu4Pg6MWUOB9JV89M90a2v9xHD-mHLhVgiqiR5VBz7o6Oo75x5zREB_UauYO3P5TnJEREXAMGc_ZWQiXhHr9LrShUYBcEP2DB1xpibXnq2m6pCvK9bSCVxyMZFtTHKGes7cQV6KTUd5S4ujAa7EQgccf6WXTUu43h1_VRN9NxJcFgkD8C4JKJPz1jYmNP7neL3qQmVIBRurA_gMkXP2ipT--7pN3PsYMKdE6XhNhbbCUJMrNoX7cT-_urvLXYAjHfplch3VrRANgr2auCQ2DtOJb9bztaTdB0A14Imdvujbi1ISSc","use":"sig"}]},"next_update":1694097612}

I'm not sure what else to check. I think scitokens-verify needs a verbose mode.

maarten-litmaath commented 1 year ago

Hi all, could there be a timeout in populating the key cache? Does it depend on acquiring a lock? If so, does the directory on the failing host happen to be mounted through NFS?

bbockelm commented 1 year ago

Hi,

Don't forget there are two different downloads: one for the metadata, the second for the public key. Right now, those seem to point to the same IP address (but maybe there's an implementation detail I can't see?). @steve8x8 - can you try both of these commands:

curl -v https://cilogon.org/igwn/.well-known/openid-configuration
curl -v https://cilogon.org:443/oauth2/certs/igwn

(if there's nothing private in the output, attaching them as a file would be useful)

Now, if the networking setup looks clear, I know that @djw8605 was merging some PRs around bugs in the latest version... maybe there's some undefined behavior triggering on one box but not the other?

Brian

DrDaveD commented 1 year ago

I have looked a little into another timeout that @mambelli was experiencing. In his case he was using a strange test network setup that ended up not being able to get a response from the DNS the first couple of tries, each of which timed out after 2 seconds. We found out that the scitokens timeout was only 4 seconds, which @djw8605 confirmed. That issue was worked around by updating /etc/resolv.conf to prioritize a different DNS which could respond right away.

steve8x8 commented 12 months ago

Wow, what a lot of responses ;-) Let me work my way through them.

@jbasney I've kept straces from a successful and a failed attempt (different hosts), and the failed one diverts right after closing the keycache database file, where I run into a long SIGPIPE loop:

socketpair(AF_UNIX, SOCK_STREAM, 0, [3, 4]) = 0
fcntl(3, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x14cf9312cd60}, NULL, 8) = 0
socketpair(AF_UNIX, SOCK_STREAM, 0, [5, 6]) = 0
mmap(NULL, 2101248, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x14cf91d58000
mprotect(0x14cf91d59000, 2097152, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x14cf91f57d30, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1781962], tls=0x14cf91f58700, child_tidptr=0x14cf91f589d0) = 1781962
rt_sigaction(SIGPIPE, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x14cf9312cd60}, NULL, 8) = 0
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x14cf9312cd60}, 8) = 0

I was wondering which pipe would be involved here, but perhaps that's a red herring.

@maarten-litmaath This seems to happen after accessing the keycache - which (of course, like all cluster users' homes) is on NFS.

@bbockelm Both commands seem to succeed:

$ curl -v https://cilogon.org/igwn/.well-known/openid-configuration
*   Trying 3.14.54.175:443...
* Connected to cilogon.org (3.14.54.175) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=Illinois; O=University of Illinois; CN=cilogon.org
*  start date: Apr  3 00:00:00 2023 GMT
*  expire date: May  3 23:59:59 2024 GMT
*  subjectAltName: host "cilogon.org" matched cert's "cilogon.org"
*  issuer: C=US; ST=MI; L=Ann Arbor; O=Internet2; OU=InCommon; CN=InCommon RSA Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55f02f0f81e0)
> GET /igwn/.well-known/openid-configuration HTTP/2
> Host: cilogon.org
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200 
< content-type: application/json;charset=UTF-8
< date: Fri, 08 Sep 2023 10:48:06 GMT
< set-cookie: _cilogon_org_oauth2=4eb32aa3bdcdaa28; Path=/; HttpOnly; Secure
< strict-transport-security: max-age=63072000
< x-content-type-options: nosniff
< x-frame-options: DENY
< content-length: 1510
< 
{
 "authorization_endpoint": "https://cilogon.org/authorize",
 "registration_endpoint": "https://cilogon.org:443/oauth2/oidc-cm",
 "jwks_uri": "https://cilogon.org:443/oauth2/certs/igwn",
 "issuer": "https://cilogon.org/igwn",
 "token_endpoint": "https://cilogon.org:443/oauth2/token",
 "userinfo_endpoint": "https://cilogon.org:443/oauth2/userinfo",
 "introspection_endpoint": "https://cilogon.org:443/oauth2/introspect",
 "revocation_endpoint": "https://cilogon.org:443/oauth2/revoke",
 "revocation_endpoint_auth_methods_supported": ["client_secret_basic"],
 "device_authorization_endpoint": "https://cilogon.org/oauth2/device_authorization",
 "token_endpoint_auth_methods_supported":  [
  "client_secret_post",
  "client_secret_basic",
  "private_key_jwt"
 ],
 "subject_types_supported": ["public"],
 "code_challenge_method_supported":  [
  "plain",
  "S256"
 ],
 "scopes_supported":  [
  "org.cilogon.userinfo",
  "openid",
  "profile",
  "email",
  "offline_access"
 ],
 "response_types_supported":  [
  "code",
  "id_token"
 ],
 "grant_types_supported":  [
  "web",
  "urn:oa4mp:params:oauth:grant-type:token-info",
  "urn:ietf:params:oauth:grant-type:token-exchange",
  "refresh_token",
  "authorization_code",
  "urn:ietf:params:oauth:grant-type:device_code"
 ],
 "response_modes_supported":  [
  "query",
  "fragment",
  "form_post"
 ],
 "claims_supported":  [
  "sub",
  "aud",
  "iss",
  "exp",
  "iat",
  "email"
 ],
 "id_token_signing_alg_values_supported":  [
  "RS256",
  "RS384",
  "RS512"
 ]
* Connection #0 to host cilogon.org left intact
}

and

$ curl -v https://cilogon.org:443/oauth2/certs/igwn
*   Trying 3.16.221.198:443...
* Connected to cilogon.org (3.16.221.198) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=Illinois; O=University of Illinois; CN=cilogon.org
*  start date: Apr  3 00:00:00 2023 GMT
*  expire date: May  3 23:59:59 2024 GMT
*  subjectAltName: host "cilogon.org" matched cert's "cilogon.org"
*  issuer: C=US; ST=MI; L=Ann Arbor; O=Internet2; OU=InCommon; CN=InCommon RSA Server CA
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55e2786b01e0)
> GET /oauth2/certs/igwn HTTP/2
> Host: cilogon.org
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200 
< content-type: application/json;charset=UTF-8
< date: Fri, 08 Sep 2023 10:48:44 GMT
< set-cookie: _cilogon_org_oauth2=4eb32aa3bdcdaa28; Path=/; HttpOnly; Secure
< strict-transport-security: max-age=63072000
< x-content-type-options: nosniff
< x-frame-options: DENY
< 
{"keys": [
  {
  "n": "AKtyEMIR2_ymESOtrak2QJfdQvVTip-QoC_fIgorqO47p_iBkkrvwRdyQoi_MyNi-5nLQ1pPScaL3niLRvgzx11jY7GoIil2gMN8NhY6dX94dCQC3Qsvo5C3tvla1SkfEGO8y52z8biReVQZy8_Hdy6l__IGveR6aQmC1tKecf1e-kJROS46yA9qpsNEtkREdwZAD9tmrDgddbfXAcCUtfdc62dEZTryuDcJGoXEaWBAobVIGIE5e7IYjsiZz_KtIOgaDqoguXeyttZB1-g2ok8PT9S3Xdt3gP4Wy3PE_qO58UCAv5YejG6Uau8XyxKfxGQrPnXERVJl3PQF1NCJj-8",
  "e": "AQAB",
  "alg": "RS256",
  "kid": "2FEC35C60764D27594EBEC4D98824CD8",
  "use": "sig",
  "kty": "RSA"
 },
  {
  "n": "ALVOmsYrvRb8jzjyS99p2J22bP7Tvv9FGNzW62xDLKy77rVud0zMAUh7RJwUROmAoRoJ1soDIDXJH3-F4V_pZpI80_nGalXjX2SSbeNFtQVSilXdlS-CPwBE59A4lueDTvIg4925pI5PR50ShWQHMXnXU4xM5iFs9V0wtZsbKvs8Flsu2va_pnbhOb_3otEaz68iI2SI8srP-cz5e18l5SI-xtGie_IOWYkVN9Bs2O0ELcpI2hMT86m_dXJGUy5rMRpk6gTLGh3kis4wZN4OLrLvO_qTo3BU9I54nvCwiaQVWJwpZAq4Vf5jwXdrSCNnHLYOiPB1jix-Yor4JUJP8lk",
  "e": "AQAB",
  "alg": "RS384",
  "kid": "680182D6A0BF3BFF98E7DC211991EA44",
  "use": "sig",
  "kty": "RSA"
 },
  {
  "n": "AItqzHE1vr_intX7uYEz5wf7Ppn2FSJUdhDn-GNu4Pg6MWUOB9JV89M90a2v9xHD-mHLhVgiqiR5VBz7o6Oo75x5zREB_UauYO3P5TnJEREXAMGc_ZWQiXhHr9LrShUYBcEP2DB1xpibXnq2m6pCvK9bSCVxyMZFtTHKGes7cQV6KTUd5S4ujAa7EQgccf6WXTUu43h1_VRN9NxJcFgkD8C4JKJPz1jYmNP7neL3qQmVIBRurA_gMkXP2ipT--7pN3PsYMKdE6XhNhbbCUJMrNoX7cT-_urvLXYAjHfplch3VrRANgr2auCQ2DtOJb9bztaTdB0A14Imdvujbi1ISSc",
  "e": "AQAB",
  "alg": "RS512",
  "kid": "90387F85055ECED268E374A28414A056",
  "use": "sig",
  "kty": "RSA"
 }
* Connection #0 to host cilogon.org left intact
]}

@DrDaveD I'll look into that - indeed the node in question is behind yet another gateway, which may have an effect on response times. Would there be a way to increase the timeout?

Some of you have already guessed that I'm trying to use htgettoken with the cilogin.org/igwn issuer, and I'm apparently lacking some basic understanding of what's going on: htgettoken will tell me that there are files below /tmp and /run but the keycache isn't mentioned, and I probably did things the wrong way for some time. What I have found now, after leaving a token over night (they use to expire after 3 hours, right?):

Somehow I'm not getting the knack of local files (the ones in /tmp and /run/user/$id) and the global keycache which seems to be able to hold only one token at a time (per scope / issuer / ... ?), and what gets written and read where...

Pardon me for being incredibly stupid!

DrDaveD commented 12 months ago

@DrDaveD I'll look into that - indeed the node in question is behind yet another gateway, which may have an effect on response times. Would there be a way to increase the timeout?

There is not currently a way to increase the timeout, it is a compiled-in limit only. That would probably be a good feature request, however.

Some of you have already guessed that I'm trying to use htgettoken with the cilogin.org/igwn issuer, and I'm apparently lacking some basic understanding of what's going on: htgettoken will tell me that there are files below /tmp and /run but the keycache isn't mentioned, and I probably did things the wrong way for some time. What I have found now, after leaving a token over night (they use to expire after 3 hours, right?):

htgettoken does not mention the scitokens keycache because it does not use any scitokens code (it just passes through those tokens, it does nothing to validate them). The only thing in the htgettoken package that uses scitokens is htdecodetoken, and only via the scitokens-verify command if that happens to be installed. So if that's the only place you're using scitokens code you could ignore the error. On the other hand, the scitokens library is used by other things, for example cvmfs_scitoken_helper from the cvmfs-x509-helper package, so if you try to access protected ligo/igwn CVMFS repositories on the same machine, it may also experience the timeout.

Yes the cilogon.org/igwn token issuer is configured to generate tokens that expire in 3 hours.

* Running `htgettoken` on the "failing" node would kind of refresh the token, but `htdecodetoken` (let's settle on that one since it wraps around `scitokens-verify` anyway) would present a timeout. Running the same `htgettoken` on the headnode would again kind of refresh the token, **leaving the decoded content unchanged** but make `htdecodetoken` succeed (not timeout) on the execute node, at least for a while.

The vault server that htgettoken reads from caches those bearer tokens until --minsecs seconds (default 60) before they expire. That's why you see no change in the decoded content. The fact that there was no timeout the next time I'm sure was unrelated to the fact that you ran htgettoken again, but it could very well have been related to the fact that you ran htdecodetoken again because the scitokens keycache would have caused it do fewer network accesses the second time.

* Probably tokens are meant to be created only on submit nodes (as the dependencies on the `igwn-iam` package suggest), and I should completely forget about running `htgettoken` on an exec node?

I'm not familiar with the igwn-iam package you are referring to but yes, htgettoken is expected to be used only on submit nodes and not on execute nodes.

* The final frontier would be getting CVMFS work with scitokens but I'm getting some strange failures on the exec node (as root, everything looks reasonable - as mere user, I get "Permission denied" despite having a matching token in the keycache). Is this whole stuff documented somewhere? Right now I'm blaming myself for doing stupid things out of ignorance...

I believe you are confused about what the scitokens keycache is. It is only used for validating the signatures on JSON Web Tokens (JWTs -- scitokens are JWTs), so it caches the public keys of the token issuers. It does not cache individual JWTs.

There is not good documentation on this stuff that I can point you to, unfortunately. However I can tell you the best way to debug CVMFS access with scitokens is to set CVMFS_DEBUGLOG=/tmp/somefile.log in an /etc/cvmfs/config.d/<repo>.local file where <repo> is the name of the repository that you want more info on. Then after attempting to access it look toward the end of /tmp/somefile.log.authz to see what detailed messages are there.

Somehow I'm not getting the knack of local files (the ones in /tmp and /run/user/$id) and the global keycache which seems to be able to hold only one token at a time (per scope / issuer / ... ?), and what gets written and read where...

The file in /tmp is a vault token. That provides authenticated access to the vault server. The file in /run/user is the bearer token, stored under the WLCG Bearer Token Discovery standard. The scitokens keycache only caches the public key of the token issuer.

steve8x8 commented 12 months ago

@DrDaveD thanks for the explanation (and the confirmation of lack of docs)! I found that the failing node was using a rather long list of name servers in /etc/resolv.conf, most of them pointing to DNS proxies in the internal network of the pool. Since the setup was changed after installation, to add a gateway, this is no longer necessary and I can use "our" official DNS server - response times are much shorter now and I haven't seen another timeout yet. Will continue testing next week. Thanks again!

steve8x8 commented 12 months ago

Now that "everything seems to work" (iow, the small issue has been fixed to make room for the next, possibly bigger, one), I'm wondering -

DrDaveD commented 12 months ago

You are correct that there is a network connection needed to verify the first token, but after that the verifications should be able to be done without the network using the local key cache. I don't think that is a significant issue, however. Network is required for so many different things, this is a very small network access, and there are so many different ways to DoS grid workflows that I don't think that adding one more really changes anything.

jbasney commented 12 months ago

related to #97

bbockelm commented 11 months ago

In addition to what @DrDaveD mentions, I'd point out the public keys don't necessarily need to be hosted at the same location as the issuer itself. It's common to see issuers use a hosting or CDN service (e.g., CloudFlare) that are going to be fairly impervious to a DDoS.

The tradeoff here is the system is much closer to a "fail safe" than a "fail open". In case of a trust root compromise, trust in the keys can be pulled centrally; after a fixed amount of time, you've automatically revoked all signed credentials.

steve8x8 commented 11 months ago

To summarize, the timeout was caused by a DNS server not responding - it takes ten seconds to move on, but a second server, although listed in /etc/resolv.conf, wasn't considered anymore.

I think a 4-second timeout is too short; 15 seconds would at least allow a second DNS to jump in. But I've got to believe the current timeout had been chosen intentionally, so I've got to keep this in mind. (It would have been helpful if the error message would have told me what exactly was timing out...)