hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.07k stars 2.76k forks source link

upgrade from 1.0.3 to 2.0.6 fails on metadata insert #7411

Open andycmaj opened 3 years ago

andycmaj commented 3 years ago

hasura fails to start on changing docker image from 1.0.3 -> 2.0.6

using cli 2.0.7 after upgrading metadata to v3

error seen in hasura logs

Attaching to botany_hasura
hasura_1     | {"type":"startup","timestamp":"2021-08-18T03:04:57.258+0000","level":"info","detail":{"kind":"server_configuration","info":{"live_query_options":{"batch_size":100,"refetch_delay":1},"transaction_isolation":"ISOLATION LEVEL READ COMMITTED","enable_maintenance_mode":false,"enabled_log_types":["http-log","websocket-log","startup","webhook-log","query-log"],"server_host":"HostAny","enable_allowlist":false,"remote_schema_permissions":false,"log_level":"info","auth_hook_mode":null,"use_prepared_statements":true,"unauth_role":null,"stringify_numeric_types":false,"v1-boolean-null-collapse":false,"graceful_shutdown_timeout":60,"enabled_apis":["metadata","graphql","config","pgdump"],"enable_telemetry":false,"enable_console":false,"auth_hook":null,"infer_function_permissions":true,"experimental_features":[],"events_fetch_batch_size":100,"jwt_secret":{"audience":null,"claims_format":"json","claims_namespace":"https://hasura.io/jwt/claims","key":"<JWK REDACTED>","header":null,"type":"<TYPE REDACTED>","issuer":null},"cors_config":{"allowed_origins":"*","disabled":false,"ws_read_cookie":null},"websocket_compression_options":"NoCompression","console_assets_dir":null,"admin_secret_set":true,"port":8080,"websocket_keep_alive":"KeepAliveDelay {unKeepAliveDelay = Seconds {seconds = 5s}}"}}}
hasura_1     | {"type":"startup","timestamp":"2021-08-18T03:04:57.258+0000","level":"info","detail":{"kind":"postgres_connection","info":{"retries":1,"database_url":"postgre
s://postgres:...@postgres:5432/botany"}}}

====== ERROR HERE =======
hasura_1     | {"type":"startup","timestamp":"2021-08-18T03:04:57.258+0000","level":"error","detail":{"kind":"catalog_migrate","info":{"internal":{"statement":"\n    INSERT INTO hdb_catalog.hdb_metadata(id, metadata)\n    VALUES (1, $1::json)\n    ","prepared":true,"error":{"exec_status":"FatalError","hint":null,"message":"duplicate key value violates unique constraint \"hdb_metadata_pkey\"","status_code":"23505","description":"Key (id)=(1) already exists."},"arguments":["(Oid 114,Just (\"{\\\"sources\\\":[{\\\"kind\\\":\\\"postgres\\\",\\\"name\\\":\\\"default\\\",\\\"tables\\\":[{\\\"select_permissions\\\":[{\\\"role\\\":\\\"user\\\",\\\"permission\\\":{\\\"columns\\\":[\\\"id\\\",\\\"actorId\\\",\\\"type\\\",\\\"subject\\\",\\\"source\\\",\\\"data\\\",\\\"created_at\\\",\\\"timestamp\\\",\\\"aboutUserId\\\",\\\"activityHash\\\",\\\"updated_at\\\",\\\"wasBackfilled\\\",\\\"projectName\\\"],\\\"filter\\\":{\\\"_or\\\":[{\\\"mappedProjects\\\":{\\\"organizationIntegration\\\":{\\\"organizationId\\\":{\\\"_eq\\\":\\\"x-hasura-org-id\\\"}}}},{\\\"actorId\\\":{\\\"_eq\\\":\\\"X-Hasura-User-Id\\\"}},{\\\"aboutUserId\\\":{\\\"_eq\\\":\\\"X-Hasura-User-Id\\\"}}]}}}],\\\"object_relationships\\\":[{\\\"using\\\":{\\\"foreign_key_constraint_on\\\":\\\"aboutUserId\\\"},\\\"name\\\":\\\"aboutUser\\\"},{\\\"using\\\":{\\\"manual_configuration\\\":{\\\"remote_table\\\":{\\\"schema\\\":\\\"public\\\",\\\"name\\\":\\\"user\\\"},\\\"insertion_order\\\":null,\\\"column_mapping\\\":{\\\"actorId\\\":\\\"id\\\"}}},\\\"name\\\":\\\"actor\\\"}],\\\"tabl:

repro steps:

starting with a remote postgres backup from a hasura 1.0.3 server in production...

  1. start local hasura server 1.0.3 + cli 2.0.7
  2. upgrade metadata 1->2, then 2->3
  3. kill hasura
  4. change docker image to 2.0.6
  5. start hasura

expect: starts in new server version actual: does not start. see error above.

when i truncate hdb_metadata table, the hasura startup can proceed and metadata gets re-populated from the metadata files.

what would have been the correct way to do this migration locally and apply to the remote production db?

rikinsk commented 3 years ago

@andycmaj I have a couple of questions here

I am curious how you updated the metadata from v2->v3 as you mentioned. The CLI project version update command uses the server as the source of truth, hence it shouldnt have been possible if you hadnt already updated your server to 2.0.

Also the local metadata is not involved during the server version update. During an update from v1 to v2, the server runs catalog migrations on the metadata stored in the hdb_catalog to update it to the latest format. From the error you shared, it seems like the catalog migration had already been done and an entry was already present in the hdb_metadata table which is also surprising.

In any case, your solution to drop the hdb_catalog was probably the best way to recover from this state.

As it is not exactly clear on how we ended up in this situation precisely, It would be helpful if you could try the process you mentioned once again to see if this error is reproducable.

PS: Sorry for the delayed response.

andycmaj commented 3 years ago

hey @rikinsk thanks for the response!

I think I was confused about the order of metadata updates here and you're right, this repro no longer makes sense.

however I think I did identify why the server was not restarting once upgraded.

I had a remote schema with a header based on an x-hasura- prefixed environment variable. those seem to be prohibited in version 2.

so I removed that header and between that and dropping the catalog metadata, all went well.

fwiw I think the metadata upgrade had to be done locally while running server v2, then committed so that the new metadata is present when doing the upgrades of the production server.

this was pretty confusing to me, so maybe we could use some documentation with advice on how to upgrade production servers...

anyway I think you can probably close this unless you think the metadata can be handled more smoothly. or call it a documentation bug maybe