Open mike-sul opened 3 years ago
This is possibly a libcurl bug; would be useful to see if you can reproduce with just the curl
CLI too.
It's not reproducible in the case of curl
CLI, curl exits/timeouts (~2m30s) at some point while a network connection is down.
6 1024M 6 69.9M 0 0 116k 0 2:29:36 0:10:13 2:19:23 0
6 1024M 6 69.9M 0 0 115k 0 2:30:50 0:10:18 2:20:32 0* OpenSSL SSL_read: Connection timed out, errno 110
Perhaps, the diff in behavior is caused by the diff in libcurl usage by libostree and the curl CLI. The later does curl_easy_perform -> easy_perform -> easy_transfer
The setting of CURLOPT_TIMEOUT
helped us to overcome the issue https://github.com/foundriesio/meta-lmp/commit/2b742b96341a245b2a1c66bb93a837345e533c63.
I suppose it makes sense to expose API that would allow setting the timeout value for libostree clients regardless of the root cause.
Thanks for the report. Agree, we should support timeouts. I don't understand why this issue manifests as persistent DNS failures though - that seems like a bug elsewhere. But this needs more investigation.
I don't understand why this issue manifests as persistent DNS failures though - that seems like a bug elsewhere.
There are several ongoing FDs/connections at the moment when a network connection goes down, and each of these FDs is at a different state (domain name resolution, connecting, reading/writing, etc ) at this moment. I don't think that the issue is related to Could not resolve host
errors because the corresponding FDs are actually removed from the epoll FDs just after the resolution error/timeout occurs. The other FDs stay there forever so pull_termination_condition
never becomes true
(pull_data->n_outstanding_content_fetches
never becomes zero).
We reproduced the given issue in two absolutely different environments.
I was briefly looking at this after an IRC discussion around a FCOS node that was apparently hung pulling updates.
I think the simplest thing here is to add e.g. a default 2 minute timeout (also configurable via repo config and pull GVariant API) in the main pull code that errors out if no data is transferred. We already have _ostree_fetcher_bytes_transferred()
, just need to cache the previous version in the pull data and compare.
I started on the below, but then I realized that the initial half the pull code is synchronous. We'll probably need to drive this logic down into the fetcher.
From ee44caead70ff91f83f02def8e830396bc3656c4 Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Fri, 29 Oct 2021 14:04:06 -0400
Subject: [PATCH] wip
---
src/libostree/ostree-repo-pull-private.h | 1 +
src/libostree/ostree-repo-pull.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+)
diff --git a/src/libostree/ostree-repo-pull-private.h b/src/libostree/ostree-repo-pull-private.h
index 59b72e88..762492ac 100644
--- a/src/libostree/ostree-repo-pull-private.h
+++ b/src/libostree/ostree-repo-pull-private.h
@@ -119,6 +119,7 @@ typedef struct {
/* Objects imported via hardlink/reflink/copying or --localcache-repo*/
guint n_imported_metadata;
guint n_imported_content;
+ guint64 previous_bytes_transferred;
gboolean timestamp_check; /* Verify commit timestamps */
char *timestamp_check_from_rev;
diff --git a/src/libostree/ostree-repo-pull.c b/src/libostree/ostree-repo-pull.c
index 6bb040a4..224eae7e 100644
--- a/src/libostree/ostree-repo-pull.c
+++ b/src/libostree/ostree-repo-pull.c
@@ -62,6 +62,9 @@
* `n-network-retries` pull option. */
#define DEFAULT_N_NETWORK_RETRIES 5
+// Abort a pull operation if no bytes are transferred in this many seconds by default.
+#define DEFAULT_ZERO_BYTES_TIMEOUT_SECS 30
+
typedef struct {
OtPullData *pull_data;
GVariant *object;
@@ -3707,6 +3710,9 @@ all_requested_refs_have_commit (GHashTable *requested_refs /* (element-type Ostr
* is specified, `summary-bytes` must also be specified. Since: 2020.5
* * `disable-verify-bindings` (`b`): Disable verification of commit bindings.
* Since: 2020.9
+ * * `zerodata-timeout-secs` (`t`): Default 120. Abort pull operation if no data
+ * is transferred in a continuous window of this number of seconds.
+ * Since: 2021.6
*/
gboolean
ostree_repo_pull_with_options (OstreeRepo *self,
@@ -3732,6 +3738,7 @@ ostree_repo_pull_with_options (OstreeRepo *self,
char **configured_branches = NULL;
guint64 bytes_transferred;
guint64 end_time;
+ guint64 zerodata_timeout_secs = DEFAULT_ZERO_BYTES_TIMEOUT_SECS;
guint update_frequency = 0;
OstreeRepoPullFlags flags = 0;
const char *dir_to_pull = NULL;
@@ -3740,6 +3747,7 @@ ostree_repo_pull_with_options (OstreeRepo *self,
g_autoptr(GVariantIter) collection_refs_iter = NULL;
g_autofree char **override_commit_ids = NULL;
g_autoptr(GSource) update_timeout = NULL;
+ g_autoptr(GSource) zerodata_timeout_source = NULL;
gboolean opt_per_object_fsync = FALSE;
gboolean opt_gpg_verify_set = FALSE;
gboolean opt_gpg_verify_summary_set = FALSE;
@@ -3801,6 +3809,7 @@ ostree_repo_pull_with_options (OstreeRepo *self,
(void) g_variant_lookup (options, "timestamp-check", "b", &pull_data->timestamp_check);
(void) g_variant_lookup (options, "timestamp-check-from-rev", "s", &pull_data->timestamp_check_from_rev);
(void) g_variant_lookup (options, "max-metadata-size", "t", &pull_data->max_metadata_size);
+ (void) g_variant_lookup (options, "zerodata-timeout-secs", "t", &zerodata_timeout_secs);
(void) g_variant_lookup (options, "append-user-agent", "s", &pull_data->append_user_agent);
opt_n_network_retries_set =
g_variant_lookup (options, "n-network-retries", "u", &pull_data->n_network_retries);
@@ -4905,6 +4914,11 @@ ostree_repo_pull_with_options (OstreeRepo *self,
g_source_attach (update_timeout, pull_data->main_context);
}
+ zerodata_timeout_source = g_timeout_source_new_seconds (zerodata_timeout_secs);
+ g_source_set_priority (zerodata_timeout_source, G_PRIORITY_HIGH);
+ g_source_set_callback (zerodata_timeout_source, zerodata_timeout_cb, pull_data, NULL);
+ g_source_attach (zerodata_timeout_source, pull_data->main_context);
+
/* Now await work completion */
while (!pull_termination_condition (pull_data))
g_main_context_iteration (pull_data->main_context, TRUE);
@@ -5150,6 +5164,8 @@ ostree_repo_pull_with_options (OstreeRepo *self,
g_main_context_unref (pull_data->main_context);
if (update_timeout)
g_source_destroy (update_timeout);
+ if (zerodata_timeout_source)
+ g_source_destroy (zerodata_timeout_source);
g_strfreev (configured_branches);
g_clear_object (&pull_data->fetcher);
g_clear_pointer (&pull_data->extra_headers, (GDestroyNotify)g_variant_unref);
--
2.31.1
Met the same issue on CentOS Stream 9 with rpm-ostree-2022.8-2.el9.x86_64
Met the same issue on ubuntu-16.04 with ostree-2022.4 libostree: Version: '2022.4' Git: v2022.4 Features:
Context
Steps to reproduce
ostree pull
, for exampleThe
ostree pull
command hangs forever after performing the steps listed above.Logs
Logs during pulling when the network connection is up
Logs when the network connection is down
Logs after the network connection is down for over ~16 minutes
Strace
Strace before the hang
Strace during the state change (after around 16 minutes of networking down)
Strace after the transition to the "hang forever" state