kubeflow / metadata

Repository for assets related to Metadata.
Apache License 2.0
121 stars 67 forks source link

metadata-deployment and metadata-grpc-deployment pods go to CrashLoopBackOff because of db issue #228

Closed SatwikBhandiwad closed 3 years ago

SatwikBhandiwad commented 4 years ago

/kind bug

What steps did you take and what happened:

# kubectl logs -f metadata-deployment-6478dd54dc-2xbmf -n kubeflow
E0603 11:29:09.014566       1 main.go:98] Failed to create ML Metadata Store: The database is corrupted. In the given db, there are multiple schema_verion numbers registered in the MLMDEnv table. This may have resulted from a data race condition caused by other concurrent  MLMD's migration procedures: .
Retry 1/10.
Sleep 1.052s
E0603 11:29:10.504307       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 2/10.
Sleep 2.44s
E0603 11:29:14.102374       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 3/10.
Sleep 4.329s
E0603 11:29:19.474763       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 4/10.
Sleep 7.75s
E0603 11:29:27.307838       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 5/10.
Sleep 15.397s
E0603 11:29:42.857666       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 6/10.
Sleep 34.989s
E0603 11:30:19.001099       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 7/10.
Sleep 50.1s
E0603 11:31:09.195791       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 8/10.
Sleep 1m46.017s
E0603 11:32:59.692003       1 main.go:98] Failed to create ML Metadata Store: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection..
Retry 9/10.
Sleep 3m24.412s
# kubectl logs -f metadata-grpc-deployment-d7cc996c5-kz5rq -n kubeflow
2020-06-03 11:30:46.470520: W ml_metadata/metadata_store/metadata_store_server_main.cc:214] Connection Aborted with error: Aborted: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection.
2020-06-03 11:30:46.470793: I ml_metadata/metadata_store/metadata_store_server_main.cc:215] Retry attempt 0
2020-06-03 11:30:46.571421: W ml_metadata/metadata_store/metadata_store_server_main.cc:214] Connection Aborted with error: Aborted: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection.
2020-06-03 11:30:46.571481: I ml_metadata/metadata_store/metadata_store_server_main.cc:215] Retry attempt 1
2020-06-03 11:30:46.633912: W ml_metadata/metadata_store/metadata_store_server_main.cc:214] Connection Aborted with error: Aborted: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection.
2020-06-03 11:30:46.633963: I ml_metadata/metadata_store/metadata_store_server_main.cc:215] Retry attempt 2
2020-06-03 11:30:46.697338: W ml_metadata/metadata_store/metadata_store_server_main.cc:214] Connection Aborted with error: Aborted: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection.
2020-06-03 11:30:46.697388: I ml_metadata/metadata_store/metadata_store_server_main.cc:215] Retry attempt 3
2020-06-03 11:30:46.817047: W ml_metadata/metadata_store/metadata_store_server_main.cc:214] Connection Aborted with error: Aborted: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection.
2020-06-03 11:30:46.817117: I ml_metadata/metadata_store/metadata_store_server_main.cc:215] Retry attempt 4
2020-06-03 11:30:47.248867: F ml_metadata/metadata_store/metadata_store_server_main.cc:219] Non-OK-status: status status: Aborted: In the given db, MLMDEnv table exists but no schema_version can be found. This may be due to concurrent connection to the empty database. Please retry connection.MetadataStore cannot be created with the given connection config.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

issue-label-bot[bot] commented 4 years ago

Issue Label Bot is not confident enough to auto-label this issue. See dashboard for more details.

SatwikBhandiwad commented 4 years ago

@zhenghuiwang @jlewi

jlewi commented 4 years ago

@neuromage @jessiezcc is there someone working on metadata that can look into this?

jlewi commented 4 years ago

/cc @zhitaoli

neuromage commented 4 years ago

@jlewi I think we've fixed this problem for KFP's metadata server.

Should we deprecate the metadata server here in favour of the deployment from KFP which is maintained and updated?

jlewi commented 4 years ago

@neuromage I believe this is being discussed in kubeflow/metadata#225

ntakouris commented 4 years ago

Make sure that all the other pods are not in status container creating, pulling the images required. kubectl get pods -A and kubectl describe pods -A can be used to verify this status.

After everything downloaded the issue was fixed automatically on my machine(s) (both windows and macos, kubernetes bundled with docker)

EKami commented 4 years ago

I have exactly the same issue and I'm new to kubernetes/kubeflow. I really don't know where to start to debug this issue. Is there any workaround for this? Thanks!

issue-label-bot[bot] commented 4 years ago

Issue-Label Bot is automatically applying the labels:

Label Probability
area/front-end 0.95
area/backend 0.83

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

jlewi commented 4 years ago

@karlschriek Does anyone from the metadata group want to look into this?