gomods / athens

A Go module datastore and proxy
https://docs.gomods.io
MIT License
4.38k stars 490 forks source link

Redis lock acquired prematurely leading to 500 internal server errors #1972

Open ngshiheng opened 1 week ago

ngshiheng commented 1 week ago

Describe the bug Background: we use Athens with Redis (as our SingleFlightMode) to manage concurrent write access to our storage (S3).

We have encountered an issue in our Athens setup where requests for certain modules and versions (e.g. github.com/aws/aws-sdk-go/@v/v1.12.42.info) has resulted in a 500 status code and a "redislock: not obtained" error in our logs.

Error Message

When users see 500 Internal Server Error. Our Athens log:

redislock: not obtained

When users see 404 Not Found. Our Athens log (expected):

github.com/aws/aws-sdk-go@v1.12.42: verifying go.mod: github.com/aws/aws-sdk-go@v1.12.42/go.mod: reading https://sum.golang.org/lookup/github.com/aws/aws-sdk-go@v1.12.42: 404 Not Found
   server response:
   not found: create zip: models/apis/sagemaker/2017-07-24/api-2.json: case-insensitive file name collision: "models/apis/SageMaker/2017-07-24/api-2.json" and "models/apis/sagemaker/2017-07-24/api-2.json"
   models/apis/sagemaker/2017-07-24/docs-2.json: case-insensitive file name collision: "models/apis/SageMaker/2017-07-24/docs-2.json" and "models/apis/sagemaker/2017-07-24/docs-2.json"

To Reproduce Steps to reproduce the behavior:

  1. Clone this repository
  2. Start a Redis locally: docker run -d --name some-redis -p 6379:6379 redis
  3. Update config.dev.toml to SingleFlightType = "redis"
  4. Start Athens: make run
  5. To reproduce, rapidly send few requests to time curl http://localhost:3000/github.com/aws/aws-sdk-go/@v/v1.12.42.info. This took me ~30s on average. Here, you should see redislock: not obtained getting logged

Expected behavior The expected behavior here is for Athens to return a 404 status code as this module does not exist. However, Athens is returning a 500 status code (with a Redis lock error logged on the server-side).

Environment

Additional context

  1. Sending the same request to the public proxy https://proxy.golang.org/github.com/aws/aws-sdk-go/@v/v1.12.42.info yields immediate 404 (probably because it caches non-200 responses too)
  2. /github.com/aws/aws-sdk-go/@v/v1.12.42 is part of our transitive dependencies so there isn't a way to avoid it
  3. /github.com/aws/aws-sdk-go/@v/v1.12.42 is just an example, we've seen similar cases with other Go modules like: /cloud.google.com/go/storage/@v/v1.35.0.info, /github.com/go-redis/redis/@v/v6.15.3+incompatible.info, /gopkg.in/go-playground/validator.v9/@v/v9.30.1.info (... and more). Generally they seemed to be 'problematic' modules to run go mod download -json on

Further investigation

The root cause appears to be related to Athens obtaining the Redis lock before fetching from the upstream source (code) at module.downloadModule. In this specific case, fetching from the upstream took a significant amount of time, and obtaining the Redis lock prematurely has lead to lock contention and timeouts.

Upon further investigation, it was discovered that the bottleneck occurred within the goGetFetcher.Fetch function, which calls module.downloadModule (reference)

The module.downloadModule function is responsible for executing the go mod download -json github.com/aws/aws-sdk-go@v1.12.42. Here, the go mod download -json command took on average about 30 seconds to complete

Proposal

matt0x6F commented 1 week ago

@ngshiheng thanks for the report!