urbit / urbit

An operating function
https://urbit.org
MIT License
3.42k stars 358 forks source link

Certificate creation process fails continuously after boot #1209

Open cdelargy opened 5 years ago

cdelargy commented 5 years ago

After successful network boot of a new ship, Let's Encrypt denies creation of *.arvo.network domain certificate due to rate limit:

Error finalizing order :: too many certificates already issued for arvo.network: see https://letsencrypt.org/docs/rate-limits/

This error is repeated continuously due to retry logic. A progressively longer retry timeout, a limit of retries, or not retrying all failures would limit the doomed requests to Let's Encrypt.

vvisigoth commented 5 years ago

Interesting @joemfb now we know what the rate limit response looks like?

joemfb commented 5 years ago

Thanks for the report, @cdelargy! All those retries are supposed to back-off exponentially, so something's not working quite right. Can you post some of the surrounding console output? Also, that rate-limit message was probably part of a notification from the :talk app, containing the full HTTP response body. Could you post that as well?

cdelargy commented 5 years ago

The following is repeated every 20s, for some length of time. I did see one process no longer repeating, but it begins again when restarted.

[ %check-order-fail
  %invalid
  /~2019.2.20..04.49.03..ad37
  p=200
    q
  ~[
    [p='server' q='nginx']
    [p='content-type' q='application/json']
    [p='content-length' q='610']
    [p='x-frame-options' q='DENY']
    [p='strict-transport-security' q='max-age=604800']
    [p='expires' q='Wed, 20 Feb 2019 04:49:03 GMT']
    [p='cache-control' q='max-age=0, no-cache, no-store']
    [p='pragma' q='no-cache']
    [p='date' q='Wed, 20 Feb 2019 04:49:03 GMT']
    [p='connection' q='keep-alive']
  ]
    r
  [ ~
    [ p=610
        q
      \/'{\0a  "status": "invalid",\0a  "expires": "2019-02-27T04:48:54Z",\0a  "identifiers": [\0a    {\0a      "type": "dns",\0a      "value": "parser-firlux.arvo.net\/
        work"\0a    }\0a  ],\0a  "authorizations": [\0a    "https://acme-v02.api.letsencrypt.org/acme/authz/D34WAHQAGah-N16Vp-ZxBjZnci8D-g0A6-dDbfNzWQ8"\0a  ],\0a  "fi
        nalize": "https://acme-v02.api.letsencrypt.org/acme/finalize/51827800/323126400",\0a  "error": {\0a    "type": "urn:ietf:params:acme:error:rateLimited",\0a
        "detail": "Error finalizing order :: too many certificates already issued for: arvo.network: see https://letsencrypt.org/docs/rate-limits/",\0a    "status": 42
        9\0a  }\0a}'
      \/                                                                                                                                                               \/
    ]
  ]
]
joemfb commented 5 years ago

Thanks, this is helpful. This particular response is bypassing our rate-limit handling, which dispatches off of an HTTP 429 status code. I'm not sure yet why that cause-specific backoff didn't happen. The request that produced this response is also not retried when the status is "invalid". My best guess is that some other prior request is being wrongfully retried without backoff, but I don't see where yet. I'll keep digging.