Open vijay-wandb opened 1 week ago
From Flamarion Jorge
Here is the repro
I tested 3 different kinds of Object Storage (Ceph (RadosGW), Minio, and now AWS S3) the TEAM BYOB using the SUPPORTED_FILE_STORES and it doesn’t work consistently. This has worked in the past, and I tested myself with Minio. That worked just fine. The configuration is exactly the same. There’s no difference because the env var requires the content to be the same (although it doesn’t complain if you don’t set it correctly)
Here is the test using AWS S3 (you can replicate to any other S3)
$ export AWS_SECRET_ACCESS_KEY="FZlP2t6F1dEI4syfmba0YDiVGOYQYV9rUlrVWWya"
$ export AWS_ACCESS_KEY_ID="AKIA3G72DHZ4SFDSM5NK"
$ aws s3 ls flamarion-team-byob-test
There was no content in the bucket, then I upload a random file and listed it again
$ aws s3 ls flamarion-team-byob-test
2024-06-03 15:20:57 370 list_buckets.py
This means the Access and Secret are valid (you can use it to test). Then I configured my local deployment
$ kubectl exec -ti wandb-app-85555d5c5b-tjqfw -- env | grep SUPPORTED
Defaulted container "app" out of: app, init-db (init)
SUPPORTED_FILE_STORES=s3://AKIA3G72DHZ4SFDSM5NK:FZlP2t6F1dEI4syfmba0YDiVGOYQYV9rUlrVWWya@s3.eu-central-1.amazonaws.com/flamarion-team-byob-test
Theoretically, I can create a Team and assign the bucket flamarion-team-byob-test but it doesn’t happen. There’s a video and this is the log.
{"level":"INFO","time":"2024-06-03T13:14:32.447014485Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:148","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"availableTeam","authUser":"flamarion","defaultEntityID":6,"operationName":"ava│
ilableTeam","authUser":"flamarion","variables":{"teamName":"team-a"},"appPath":"/home"},"message":"Graphql operation availableTeam for user flamarion with variables map[teamName:team│
-a] from app path /home","dd.trace_id":"13034541662550076139"} │
{"level":"INFO","time":"2024-06-03T13:14:32.448078753Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:243","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"availableTeam","authUser":"flamarion","defaultEntityID":6,"latencyNs":1066695,│
"statusCode":200,"operationName":"availableTeam","authUser":"flamarion","variables":{"teamName":"team-a"},"latencyStr":"1.066695ms"},"message":"Graphql operation availableTeam for us│
er flamarion with variables map[teamName:team-a] finished in 1.066695ms","dd.trace_id":"13034541662550076139"} │
{"level":"INFO","time":"2024-06-03T13:14:32.448653588Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:62","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13034541662550076139","duration":"9.059942ms","statusCode":200,"path":"/graphql"},"message":│
"Finished request 13034541662550076139 in 9.059942ms with status %!s(int=200) on /graphql","dd.trace_id":"13034541662550076139"} │
{"level":"INFO","time":"2024-06-03T13:14:36.013749232Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:57","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13318444034683680619","path":"/graphql"},"message":"Starting request 13318444034683680619 on│
/graphql","dd.trace_id":"13318444034683680619"} │
{"level":"INFO","time":"2024-06-03T13:14:36.022794835Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:148","pid":1058},"data":{"dd│
.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"testBucketStoreConnection","authUser":"flamarion","defaultEntityID":6,"operati│
onName":"testBucketStoreConnection","authUser":"flamarion","variables":{"input":{"name":"flamarion-team-byob-test","provider":"AWS","organizationID":null}},"appPath":"/home"},"messag│
e":"Graphql operation testBucketStoreConnection for user flamarion with variables map[input:map[name:flamarion-team-byob-test organizationID:<nil> provider:AWS]] from app path /home"│
,"dd.trace_id":"13318444034683680619"} │
{"level":"INFO","time":"2024-06-03T13:14:36.03393013Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/relay.go:243","pid":1058},"data":{"dd.│
service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","userID":8,"operationName":"testBucketStoreConnection","authUser":"flamarion","defaultEntityID":6,"latencyN│
s":11142999,"statusCode":200,"operationName":"testBucketStoreConnection","authUser":"flamarion","variables":{"input":{"name":"flamarion-team-byob-test","provider":"AWS","organization│
ID":null}},"latencyStr":"11.142999ms"},"message":"Graphql operation testBucketStoreConnection for user flamarion with variables map[input:map[name:flamarion-team-byob-test organizati│
onID:<nil> provider:AWS]] finished in 11.142999ms","dd.trace_id":"13318444034683680619"} │
{"level":"INFO","time":"2024-06-03T13:14:36.034779852Z","info":{"program":"gorilla","source":"github.com/wandb/core/services/gorilla/api/handler/logging.go:62","pid":1058},"data":{"d│
d.service":"gorilla","dd.version":"c4f0b6c30787032000d967066a742c4a8c8e1c19","requestID":"13318444034683680619","duration":"21.027626ms","statusCode":200,"path":"/graphql"},"message"│
:"Finished request 13318444034683680619 in 21.027626ms with status %!s(int=200) on /graphql","dd.trace_id":"13318444034683680619"}
To ensure the test is not biased or tied to a strange S3 API, I used the AWS S3, in Sandbox account, Region eu-central-1 while the main bucket of my installation is my local Storage Ceph RadosGW
kubectl exec -ti wandb-app-85555d5c5b-tjqfw -- env | grep BUCKET
Defaulted container "app" out of: app, init-db (init)
BUCKET=s3://3IAXODZ870OCD6TCFIAD:wsnNA1Vq2RHbKdrXSTaw0a09h79QQBZk0AXRUMNY@pves3.home.lab/wandb
OVERFLOW_BUCKET_ADDR=s3://3IAXODZ870OCD6TCFIAD:wsnNA1Vq2RHbKdrXSTaw0a09h79QQBZk0AXRUMNY@pves3.home.lab/wandb
BUCKET_QUEUE=internal://
Please, let me know if there’s any other test I can do.
have Minio too, if necessary I can do the tests using Minio (the same that worked in the past and that I reported here. https://weightsandbiases.slack.com/archives/C0123GDE0NM/p1705056205042069?thread_ts=1704995296.871159&cid=C0123GDE0NM
@flamarion @levinandrew - Is this a current issue? Sorry I hadn't heard of this until now. Should not be in the consultant queue anyways.
@Flamarion @Al I was incorrect in that scenario 2 worked without a code change. I could not connect from
local server instance
->minio bucket on my localhost
without making any code changes. I again had a false positive where the UI said the connection worked, and at the end of the day Thursday I forgot to check that backend actually was able to connect.I still do not know if the code is in a working state, and if it is, for what scenarios it works. I am going to note my observations here below, but keep in mind that they could be due to issues with my local setup.
I spin up the
local server instance
in a docker container, and theminio bucket on my localhost
in another docker container. Therefore, I still needed to use ngrok to get them to connect together. (I didn't yet have a chance to test spinning up the minio BYOB bucket in the same docker container.) So, for scenario 2 I needed to make these same code change that remove the AWS GetBucketRegion sdk call in order for the local gorilla app can connect to the bucket and upload objects to it. And the app can connect, and create a team.However, the SDK fails to upload. For example,
and then the logs show