wandb / terraform-aws-wandb

A terraform module for deploying Weights & Biases on AWS.
Apache License 2.0
17 stars 19 forks source link

Support for Team BYOB On-Prem stopped to work (SUPPORTED_FILE_STORES) #272

Open vijay-wandb opened 1 week ago

vijay-wandb commented 1 week ago

@Flamarion @Al I was incorrect in that scenario 2 worked without a code change. I could not connect from local server instance -> minio bucket on my localhostwithout making any code changes. I again had a false positive where the UI said the connection worked, and at the end of the day Thursday I forgot to check that backend actually was able to connect.

I still do not know if the code is in a working state, and if it is, for what scenarios it works. I am going to note my observations here below, but keep in mind that they could be due to issues with my local setup.


I spin up the local server instance in a docker container, and the minio bucket on my localhost in another docker container. Therefore, I still needed to use ngrok to get them to connect together. (I didn't yet have a chance to test spinning up the minio BYOB bucket in the same docker container.) So, for scenario 2 I needed to make these same code change that remove the AWS GetBucketRegion sdk call in order for the local gorilla app can connect to the bucket and upload objects to it. And the app can connect, and create a team.

However, the SDK fails to upload. For example,

>>> run = wandb.init(project="pOP", entity='onpremb')
wandb: Currently logged in as: andrew-levin (onpremb). Use `wandb login --relogin` to force relogin
wandb: ERROR wandb version 0.16.4.dev1 has been retired!  Please upgrade.
wandb: wandb version 0.17.0 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.4.dev1
wandb: Run data is saved locally in /Users/andrew/repos/core/services/gorilla/wandb/run-20240523_001524-d2v2js48
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run kind-water-1
wandb: ⭐️ View project at <https://app.wandb.test/onpremb/pOP>
wandb: 🚀 View run at <https://app.wandb.test/onpremb/pOP/runs/d2v2js48>
>>> wandb: ERROR Error uploading "wandb-metadata.json": CommError, <Response [404]>
wandb: ERROR Error uploading "upstream_diff_a85b9ecffdd97c42324aee25dd46526a2dbd450e.patch": CommError, <Response [404]>
wandb: ERROR Error uploading "diff.patch": CommError, <Response [404]>

and then the logs show


2024-05-23 0026,769 INFO    SenderThread:50951 [sender.py1403] saving file wandb-metadata.json with policy now
2024-05-23 0026,822 ERROR   wandb-upload_0:50951 [internal_api.py2765] upload_file exception https://app.wandb.test/privb/onpremb/pOP/d2v2js48/wandb-metadata.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=uwIvWSLpkzpzywH5Z94q%2F20240523%2Fwandb-local%2Fs3%2Faws4_request&X-Amz-Date=20240523T071526Z&X-Amz-Expires=86400&X-Amz-SignedHeaders=host&X-User=andrew-levin&X-Amz-Signature=82bc2a4957c67265f2795faba2e130eaf42c1a9a8d721d4234f416988e7d9143: 404 Client Error: Not Found for url:...

Issue created in Slack from a [message](https://weightsandbiases.slack.com/archives/C0123GDE0NM/p1716451617039009?thread_ts=1715198459.714839&cid=C0123GDE0NM).

https://github.com/user-attachments/assets/66231f7e-9332-4b2a-abff-a620ccbf7937
vijay-wandb commented 1 week ago

From Flamarion Jorge

Here is the repro

I tested 3 different kinds of Object Storage (Ceph (RadosGW), Minio, and now AWS S3) the TEAM BYOB using the SUPPORTED_FILE_STORES and it doesn’t work consistently. This has worked in the past, and I tested myself with Minio. That worked just fine. The configuration is exactly the same. There’s no difference because the env var requires the content to be the same (although it doesn’t complain if you don’t set it correctly)

Here is the test using AWS S3 (you can replicate to any other S3)

$ export AWS_SECRET_ACCESS_KEY="FZlP2t6F1dEI4syfmba0YDiVGOYQYV9rUlrVWWya"
$ export AWS_ACCESS_KEY_ID="AKIA3G72DHZ4SFDSM5NK"
$ aws s3 ls flamarion-team-byob-test

There was no content in the bucket, then I upload a random file and listed it again

$ aws s3 ls flamarion-team-byob-test
2024-06-03 15:20:57        370 list_buckets.py

This means the Access and Secret are valid (you can use it to test). Then I configured my local deployment

 $ kubectl exec -ti wandb-app-85555d5c5b-tjqfw -- env | grep SUPPORTED
Defaulted container "app" out of: app, init-db (init)
SUPPORTED_FILE_STORES=s3://AKIA3G72DHZ4SFDSM5NK:FZlP2t6F1dEI4syfmba0YDiVGOYQYV9rUlrVWWya@s3.eu-central-1.amazonaws.com/flamarion-team-byob-test

Theoretically, I can create a Team and assign the bucket flamarion-team-byob-test but it doesn’t happen. There’s a video and this is the log.

{"level":"INFO","time":"2024-06-03T13:14:32.447014485Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:148","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"availableTeam","authUser":"flamarion","defaultEntityID":6,"operationName":"ava│
ilableTeam","authUser":"flamarion","variables":{"teamName":"team-a"},"appPath":"/home"},"message":"Graphql operation availableTeam for user flamarion with variables map[teamName:team│
-a] from app path /home","dd.trace_id":"13034541662550076139"}                                                                                                                        │
{"level":"INFO","time":"2024-06-03T13:14:32.448078753Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:243","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"availableTeam","authUser":"flamarion","defaultEntityID":6,"latencyNs":1066695,│
"statusCode":200,"operationName":"availableTeam","authUser":"flamarion","variables":{"teamName":"team-a"},"latencyStr":"1.066695ms"},"message":"Graphql operation availableTeam for us│
er flamarion with variables map[teamName:team-a] finished in 1.066695ms","dd.trace_id":"13034541662550076139"}                                                                        │
{"level":"INFO","time":"2024-06-03T13:14:32.448653588Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:62","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13034541662550076139","duration":"9.059942ms","statusCode":200,"path":"/graphql"},"message":│
"Finished request 13034541662550076139 in 9.059942ms with status %!s(int=200) on /graphql","dd.trace_id":"13034541662550076139"}                                                      │
{"level":"INFO","time":"2024-06-03T13:14:36.013749232Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:57","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13318444034683680619","path":"/graphql"},"message":"Starting request 13318444034683680619 on│
 /graphql","dd.trace_id":"13318444034683680619"}                                                                                                                                      │
{"level":"INFO","time":"2024-06-03T13:14:36.022794835Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:148","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"testBucketStoreConnection","authUser":"flamarion","defaultEntityID":6,"operati│
onName":"testBucketStoreConnection","authUser":"flamarion","variables":{"input":{"name":"flamarion-team-byob-test","provider":"AWS","organizationID":null}},"appPath":"/home"},"messag│
e":"Graphql operation testBucketStoreConnection for user flamarion with variables map[input:map[name:flamarion-team-byob-test organizationID:<nil> provider:AWS]] from app path /home"│
,"dd.trace_id":"13318444034683680619"}                                                                                                                                                │
{"level":"INFO","time":"2024-06-03T13:14:36.03393013Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:243","pid":1058},"data":{"dd.│
service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"testBucketStoreConnection","authUser":"flamarion","defaultEntityID":6,"latencyN│
s":11142999,"statusCode":200,"operationName":"testBucketStoreConnection","authUser":"flamarion","variables":{"input":{"name":"flamarion-team-byob-test","provider":"AWS","organization│
ID":null}},"latencyStr":"11.142999ms"},"message":"Graphql operation testBucketStoreConnection for user flamarion with variables map[input:map[name:flamarion-team-byob-test organizati│
onID:<nil> provider:AWS]] finished in 11.142999ms","dd.trace_id":"13318444034683680619"}                                                                                              │
{"level":"INFO","time":"2024-06-03T13:14:36.034779852Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:62","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13318444034683680619","duration":"21.027626ms","statusCode":200,"path":"/graphql"},"message"│
:"Finished request 13318444034683680619 in 21.027626ms with status %!s(int=200) on /graphql","dd.trace_id":"13318444034683680619"}

To ensure the test is not biased or tied to a strange S3 API, I used the AWS S3, in Sandbox account, Region eu-central-1 while the main bucket of my installation is my local Storage Ceph RadosGW

kubectl exec -ti wandb-app-85555d5c5b-tjqfw -- env | grep BUCKET
Defaulted container "app" out of: app, init-db (init)
BUCKET=s3://3IAXODZ870OCD6TCFIAD:wsnNA1Vq2RHbKdrXSTaw0a09h79QQBZk0AXRUMNY@pves3.home.lab/wandb
OVERFLOW_BUCKET_ADDR=s3://3IAXODZ870OCD6TCFIAD:wsnNA1Vq2RHbKdrXSTaw0a09h79QQBZk0AXRUMNY@pves3.home.lab/wandb
BUCKET_QUEUE=internal://

Please, let me know if there’s any other test I can do.

have Minio too, if necessary I can do the tests using Minio (the same that worked in the past and that I reported here. https://weightsandbiases.slack.com/archives/C0123GDE0NM/p1705056205042069?thread_ts=1704995296.871159&cid=C0123GDE0NM

abhinavg6 commented 1 week ago

@flamarion @levinandrew - Is this a current issue? Sorry I hadn't heard of this until now. Should not be in the consultant queue anyways.