ostreedev / ostree

Operating system and container binary deployment and upgrades
https://ostreedev.github.io/ostree/
Other
1.23k stars 287 forks source link

ostree pull hangs forever if a network connection is down for more than ~16 minutes #2383

Open mike-sul opened 3 years ago

mike-sul commented 3 years ago

Context

curl --version
curl 7.72.0-DEV (x86_64-pc-linux-gnu) libcurl/7.78.0-DEV OpenSSL/1.1.1 zlib/1.2.11
ostree --version
libostree:
 Version: '2020.7'
 Git: v2020.7
 Features:
  - libcurl
  - gpgme
  - libarchive
  - selinux
  - systemd
  - release
  - p2p

Steps to reproduce

  1. Run ostree pull, for example
    ostree --repo=repo pull ostree-proxy e65818d7969c48a32235292b3eadf98b1163b901f5cd2b2e8b902592934b7472
  2. Wait until metadata are pulled and fetching of objects starts (/objects//.filez)
  3. Set the network connection down;
  4. Wait for over 16 minutes;
  5. Set the network connection up.

The ostree pull command hangs forever after performing the steps listed above.

Logs

Logs during pulling when the network connection is up

(ostree pull:28521): OSTree-DEBUG: 11:29:28.230: starting fetch of 33de3cf9181e8342a29288aa300f87c0847710ad17b2eaef38b93b3462042e0f.file
* STATE: INIT => CONNECT handle 0x55f88cbe0c78; line 1761 (connection #-5000)
* Found bundle for host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io: 0x55f88cbd4078 [serially]
* Re-using existing connection! (#2) with host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io
* Connected to 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io (34.134.229.197) port 8443 (#2)
* STATE: CONNECT => DO handle 0x55f88cbe0c78; line 1815 (connection #2)
> GET /ostree/objects/33/de3cf9181e8342a29288aa300f87c0847710ad17b2eaef38b93b3462042e0f.filez HTTP/1.1^M
Host: 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443^M
User-Agent: libostree/2020.7^M
Accept: */*^M
^M
* STATE: DO => DID handle 0x55f88cbe0c78; line 2049 (connection #2)
* STATE: DID => PERFORMING handle 0x55f88cbe0c78; line 2168 (connection #2)
(ostree pull:28521): OSTree-DEBUG: 11:29:28.233: write of f6057a884a4900493b185331449eccb3a5371186f35bf5a5dda40ee0866fca13.file complete
* Mark bundle as not supporting multiuse
* HTTP 1.1 or later with persistent connection
< HTTP/1.1 200 OK^M
< Content-Type: application/octet-stream^M
< Date: Tue, 22 Jun 2021 08:29:28 GMT^M
< Content-Length: 120^M
< ^M
* STATE: PERFORMING => DONE handle 0x55f88cbbd518; line 2367 (connection #1)
* multi_done
* Connection #1 to host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io left intact
* Expire cleared (transfer 0x55f88cbbd518)
(ostree pull:28521): OSTree-DEBUG: 11:29:28.289: fetch of 1aaa241a9b5161d92a739a38af7b92a80b84c3a999d157572dc9848865442065.file complete
(ostree pull:28521): OSTree-DEBUG: 11:29:28.289: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5

Logs when the network connection is down

(ostree pull:28521): OSTree-DEBUG: 11:29:28.290: write of 1aaa241a9b5161d92a739a38af7b92a80b84c3a999d157572dc9848865442065.file complete
* STATE: PERFORMING => DONE handle 0x55f88cbd8208; line 2367 (connection #0)
* multi_done
* Connection #0 to host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io left intact
* Expire cleared (transfer 0x55f88cbd8208)
(ostree pull:28521): OSTree-DEBUG: 11:29:28.291: fetch of 8f455359f2e051590e7c4f26e2c1852f3029b168c5826a8e73ab898ab55f86b5.file complete
(ostree pull:28521): OSTree-DEBUG: 11:29:28.291: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
(ostree pull:28521): OSTree-DEBUG: 11:29:28.291: starting fetch of 855e67e413992599a5fe80e1f60d8a94cc6bffa8fc21869b8beee18135b758f2.file
* STATE: INIT => CONNECT handle 0x55f88cbbd518; line 1761 (connection #-5000)
* Found bundle for host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io: 0x55f88cbd4078 [serially]
* Re-using existing connection! (#0) with host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io
* Connected to 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io (34.134.229.197) port 8443 (#0)
* STATE: CONNECT => DO handle 0x55f88cbbd518; line 1815 (connection #0)
> GET /ostree/objects/85/5e67e413992599a5fe80e1f60d8a94cc6bffa8fc21869b8beee18135b758f2.filez HTTP/1.1^M
Host: 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443^M
User-Agent: libostree/2020.7^M
Accept: */*^M
^M
* STATE: DO => DID handle 0x55f88cbbd518; line 2049 (connection #0)
* STATE: DID => PERFORMING handle 0x55f88cbbd518; line 2168 (connection #0)
(ostree pull:28521): OSTree-DEBUG: 11:29:28.294: write of 8f455359f2e051590e7c4f26e2c1852f3029b168c5826a8e73ab898ab55f86b5.file complete
**<hangs here for about 16 minutes>**

Logs after the network connection is down for over ~16 minutes

* OpenSSL SSL_read: Network is unreachable, errno 101
* Connection died, retrying a fresh connect(retry count: 1)
* multi_done
* The cache now contains 7 members
* Closing connection 0
* Issue another request to this URL: 'https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/85/5e67e413992599a5fe80e1f60d8a94cc6bffa8fc21869b8beee18135b758f2.filez'
* STATE: PERFORMING => CONNECT handle 0x55f88cbbd518; line 2344 (connection #-5000)
* Found bundle for host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io: 0x55f88cbd4078 [serially]
* Added connection 8. The cache now contains 8 members
* STATE: CONNECT => RESOLVING handle 0x55f88cbbd518; line 1807 (connection #8)
* OpenSSL SSL_read: Network is unreachable, errno 101
* Connection died, retrying a fresh connect(retry count: 1)
* multi_done
* The cache now contains 7 members
* Closing connection 3
* Issue another request to this URL: 'https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/08/386f6a5052f7739ec80bd791dd5f1906f980475e7dca02d4f2b24c61f165d0.filez'
* STATE: PERFORMING => CONNECT handle 0x55f88cbd6cf8; line 2344 (connection #-5000)
* Found bundle for host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io: 0x55f88cbd4078 [serially]
* Added connection 9. The cache now contains 8 members
* STATE: CONNECT => RESOLVING handle 0x55f88cbd6cf8; line 1807 (connection #9)
* OpenSSL SSL_read: Network is unreachable, errno 101
* Connection died, retrying a fresh connect(retry count: 1)
* multi_done
* The cache now contains 7 members
* Closing connection 2
* Issue another request to this URL: 'https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/33/de3cf9181e8342a29288aa300f87c0847710ad17b2eaef38b93b3462042e0f.filez'
* STATE: PERFORMING => CONNECT handle 0x55f88cbe0c78; line 2344 (connection #-5000)
* Found bundle for host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io: 0x55f88cbd4078 [serially]
* Added connection 10. The cache now contains 8 members
* STATE: CONNECT => RESOLVING handle 0x55f88cbe0c78; line 1807 (connection #10)
* OpenSSL SSL_read: Network is unreachable, errno 101
* Connection died, retrying a fresh connect(retry count: 1)
* multi_done
* The cache now contains 7 members
* Closing connection 1
* Issue another request to this URL: 'https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/f4/0d2aa2685cc4bc2c36b36d2e4beead64c317fa6493a2a11abebf4a9999c5b6.filez'
* STATE: PERFORMING => CONNECT handle 0x55f88cbd9898; line 2344 (connection #-5000)
* Found bundle for host 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io: 0x55f88cbd4078 [serially]
* Added connection 11. The cache now contains 8 members
* STATE: CONNECT => RESOLVING handle 0x55f88cbd9898; line 1807 (connection #11)
* Could not resolve host: 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io
* The cache now contains 7 members
* Closing connection 8
* Expire cleared (transfer 0x55f88cbbd518)
(ostree pull:28521): OSTree-DEBUG: 11:46:41.829: _ostree_fetcher_should_retry_request: error: 126:0 While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/85/5e67e413992599a5fe80e1f60d8a94cc6bffa8fc21869b8beee18135b758f2.filez: [6] Couldn't resolve host name, n_retries_remaining: 5
(ostree pull:28521): OSTree-DEBUG: 11:46:41.830: Request caught error: While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/85/5e67e413992599a5fe80e1f60d8a94cc6bffa8fc21869b8beee18135b758f2.filez: [6] Couldn't resolve host name
* Could not resolve host: 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io
* The cache now contains 6 members
* Closing connection 9
* Expire cleared (transfer 0x55f88cbd6cf8)
(ostree pull:28521): OSTree-DEBUG: 11:46:45.920: _ostree_fetcher_should_retry_request: error: 126:0 While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/08/386f6a5052f7739ec80bd791dd5f1906f980475e7dca02d4f2b24c61f165d0.filez: [6] Couldn't resolve host name, n_retries_remaining: 5
(ostree pull:28521): OSTree-DEBUG: 11:46:45.920: Request caught error: While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/08/386f6a5052f7739ec80bd791dd5f1906f980475e7dca02d4f2b24c61f165d0.filez: [6] Couldn't resolve host name
* Could not resolve host: 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io
* The cache now contains 5 members
* Closing connection 10
* Expire cleared (transfer 0x55f88cbe0c78)
(ostree pull:28521): OSTree-DEBUG: 11:46:45.924: _ostree_fetcher_should_retry_request: error: 126:0 While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/33/de3cf9181e8342a29288aa300f87c0847710ad17b2eaef38b93b3462042e0f.filez: [6] Couldn't resolve host name, n_retries_remaining: 5
(ostree pull:28521): OSTree-DEBUG: 11:46:45.925: Request caught error: While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/33/de3cf9181e8342a29288aa300f87c0847710ad17b2eaef38b93b3462042e0f.filez: [6] Couldn't resolve host name
* Could not resolve host: 84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io
* The cache now contains 4 members
* Closing connection 11
* Expire cleared (transfer 0x55f88cbd9898)
(ostree pull:28521): OSTree-DEBUG: 11:46:45.928: _ostree_fetcher_should_retry_request: error: 126:0 While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/f4/0d2aa2685cc4bc2c36b36d2e4beead64c317fa6493a2a11abebf4a9999c5b6.filez: [6] Couldn't resolve host name, n_retries_remaining: 5
(ostree pull:28521): OSTree-DEBUG: 11:46:45.929: Request caught error: While fetching https://84d6f7e1-837c-4b06-b43a-bc110f368aff.ostree.foundries.io:8443/ostree/objects/f4/0d2aa2685cc4bc2c36b36d2e4beead64c317fa6493a2a11abebf4a9999c5b6.filez: [6] Couldn't resolve host name

**<hang forever>**

Strace

Strace before the hang

poll([{fd=7, events=POLLIN}, {fd=12, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 9, 998) = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=12, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 9, 999) = 0 (Timeout)

Strace during the state change (after around 16 minutes of networking down)

write(7, "\1\0\0\0\0\0\0\0", 8)         = 8
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, NULL, 8) = 0
read(12, 0x55f88d361993, 5)             = -1 ENETUNREACH (Network is unreachable)
write(2, "* ", 2)                       = 2
write(2, "OpenSSL SSL_read: Network is unr"..., 52) = 52
write(2, "* ", 2)                       = 2
write(2, "Connection died, retrying a fres"..., 58) = 58
write(2, "* ", 2)                       = 2
write(2, "multi_done\n", 11)            = 11
write(2, "* ", 2)                       = 2
write(2, "The cache now contains 7 members"..., 33) = 33
write(2, "* ", 2)                       = 2
write(2, "Closing connection 0\n", 21)  = 21
read(12, "", 5)                         = 0
write(12, "\27\3\3\0\23\324\"\23\262\0263\34\303zg\265\17\316?z\0251j\6", 24) = -1 EPIPE (Broken pipe)
--- SIGPIPE {si_signo=SIGPIPE, si_code=SI_USER, si_pid=28521, si_uid=1000} ---
close(12)                               = 0
write(2, "* ", 2)                       = 2
write(2, "Issue another request to this UR"..., 194) = 194
write(2, "* ", 2)                       = 2
write(2, "STATE: PERFORMING => CONNECT han"..., 82) = 82
write(2, "* ", 2)                       = 2
write(2, "Found bundle for host 84d6f7e1-8"..., 106) = 106
write(2, "* ", 2)                       = 2
write(2, "Added connection 8. The cache no"..., 53) = 53
socketpair(AF_UNIX, SOCK_STREAM, 0, [12, 25]) = 0
clone(child_stack=0x7f03139e2df0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f03139e39d0, tls=0x7f03139e3700, child_tidptr=0x7f03139e39d0) = 30444
write(2, "* ", 2)                       = 2
write(2, "STATE: CONNECT => RESOLVING hand"..., 77) = 77
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, NULL, 8) = 0
write(7, "\1\0\0\0\0\0\0\0", 8)         = 8
poll([{fd=7, events=POLLIN}, {fd=12, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 9, 1) = 1 ([{fd=7, revents=POLLIN}])
read(7, "\2\0\0\0\0\0\0\0", 16)         = 8
poll([{fd=7, events=POLLIN}, {fd=12, events=POLLIN}, {fd=17, events=POLLIN}, {fd=18, events=POLLIN}, {fd=19, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 9, 1) = 0 (Timeout)
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, NULL, 8) = 0

Strace after the transition to the "hang forever" state

write(1, "(ostree pull:28521): OSTree-DEBU"..., 275) = 275
sendmsg(12, {msg_name={sa_family=AF_UNIX, sun_path="/run/systemd/journal/socket"}, msg_namelen=29, msg_iov=[{iov_base="CODE_FILE=src/libostree/ostree-f"..., iov_len=45}, {iov_base="\n", iov_len=1}, {iov_base="CODE_LINE=175", iov_len=13}, {iov_base="\n", iov_len=1}, {iov_base="CODE_FUNC=_ostree_fetcher_journa"..., iov_len=41}, {iov_base="\n", iov_len=1}, {iov_base="MESSAGE=libostree HTTP error fro"..., iov_len=244}, {iov_base="\n", iov_len=1}, {iov_base="MESSAGE_ID=f02bce89a54e4efab3a94"..., iov_len=43}, {iov_base="\n", iov_len=1}, {iov_base="OSTREE_REMOTE=ostree-proxy", iov_len=26}, {iov_base="\n", iov_len=1}, {iov_base="OSTREE_URL=https://84d6f7e1-837c"..., iov_len=167}, {iov_base="\n", iov_len=1}, {iov_base="PRIORITY=3", iov_len=10}, {iov_base="\n", iov_len=1}, {iov_base="SYSLOG_IDENTIFIER=", iov_len=18}, {iov_base="ostree", iov_len=6}, {iov_base="\n", iov_len=1}], msg_iovlen=19, msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 622
rt_sigaction(SIGPIPE, NULL, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, NULL, 8) = 0
rt_sigaction(SIGPIPE, {sa_handler=SIG_IGN, sa_mask=[], sa_flags=SA_RESTORER, sa_restorer=0x7f03207438a0}, NULL, 8) = 0
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 959) = 1 ([{fd=7, revents=POLLIN}])
read(7, "\1\0\0\0\0\0\0\0", 16)         = 8
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 958) = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 0) = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 1000) = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 998) = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 998) = 0 (Timeout)
poll([{fd=7, events=POLLIN}, {fd=20, events=POLLIN}, {fd=21, events=POLLIN}, {fd=22, events=POLLIN}, {fd=23, events=POLLIN}], 5, 999) = 0 (Timeout)
cgwalters commented 3 years ago

This is possibly a libcurl bug; would be useful to see if you can reproduce with just the curl CLI too.

mike-sul commented 3 years ago

It's not reproducible in the case of curl CLI, curl exits/timeouts (~2m30s) at some point while a network connection is down.

  6 1024M    6 69.9M    0     0   116k      0  2:29:36  0:10:13  2:19:23     0
  6 1024M    6 69.9M    0     0   115k      0  2:30:50  0:10:18  2:20:32     0* OpenSSL SSL_read: Connection timed out, errno 110

Perhaps, the diff in behavior is caused by the diff in libcurl usage by libostree and the curl CLI. The later does curl_easy_perform -> easy_perform -> easy_transfer

mike-sul commented 3 years ago

The setting of CURLOPT_TIMEOUT helped us to overcome the issue https://github.com/foundriesio/meta-lmp/commit/2b742b96341a245b2a1c66bb93a837345e533c63. I suppose it makes sense to expose API that would allow setting the timeout value for libostree clients regardless of the root cause.

cgwalters commented 2 years ago

Thanks for the report. Agree, we should support timeouts. I don't understand why this issue manifests as persistent DNS failures though - that seems like a bug elsewhere. But this needs more investigation.

mike-sul commented 2 years ago

I don't understand why this issue manifests as persistent DNS failures though - that seems like a bug elsewhere.

There are several ongoing FDs/connections at the moment when a network connection goes down, and each of these FDs is at a different state (domain name resolution, connecting, reading/writing, etc ) at this moment. I don't think that the issue is related to Could not resolve host errors because the corresponding FDs are actually removed from the epoll FDs just after the resolution error/timeout occurs. The other FDs stay there forever so pull_termination_condition never becomes true (pull_data->n_outstanding_content_fetches never becomes zero). We reproduced the given issue in two absolutely different environments.

cgwalters commented 2 years ago

I was briefly looking at this after an IRC discussion around a FCOS node that was apparently hung pulling updates.

I think the simplest thing here is to add e.g. a default 2 minute timeout (also configurable via repo config and pull GVariant API) in the main pull code that errors out if no data is transferred. We already have _ostree_fetcher_bytes_transferred(), just need to cache the previous version in the pull data and compare.

cgwalters commented 2 years ago

I started on the below, but then I realized that the initial half the pull code is synchronous. We'll probably need to drive this logic down into the fetcher.

From ee44caead70ff91f83f02def8e830396bc3656c4 Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Fri, 29 Oct 2021 14:04:06 -0400
Subject: [PATCH] wip

---
 src/libostree/ostree-repo-pull-private.h |  1 +
 src/libostree/ostree-repo-pull.c         | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/src/libostree/ostree-repo-pull-private.h b/src/libostree/ostree-repo-pull-private.h
index 59b72e88..762492ac 100644
--- a/src/libostree/ostree-repo-pull-private.h
+++ b/src/libostree/ostree-repo-pull-private.h
@@ -119,6 +119,7 @@ typedef struct {
   /* Objects imported via hardlink/reflink/copying or  --localcache-repo*/
   guint             n_imported_metadata;
   guint             n_imported_content;
+  guint64           previous_bytes_transferred;

   gboolean          timestamp_check; /* Verify commit timestamps */
   char             *timestamp_check_from_rev;
diff --git a/src/libostree/ostree-repo-pull.c b/src/libostree/ostree-repo-pull.c
index 6bb040a4..224eae7e 100644
--- a/src/libostree/ostree-repo-pull.c
+++ b/src/libostree/ostree-repo-pull.c
@@ -62,6 +62,9 @@
  * `n-network-retries` pull option. */
 #define DEFAULT_N_NETWORK_RETRIES 5

+// Abort a pull operation if no bytes are transferred in this many seconds by default.
+#define DEFAULT_ZERO_BYTES_TIMEOUT_SECS 30
+
 typedef struct {
   OtPullData  *pull_data;
   GVariant    *object;
@@ -3707,6 +3710,9 @@ all_requested_refs_have_commit (GHashTable *requested_refs /* (element-type Ostr
  *     is specified, `summary-bytes` must also be specified. Since: 2020.5
  *   * `disable-verify-bindings` (`b`): Disable verification of commit bindings.
  *     Since: 2020.9
+ *   * `zerodata-timeout-secs` (`t`): Default 120.  Abort pull operation if no data 
+ *      is transferred in a continuous window of this number of seconds.
+ *     Since: 2021.6
  */
 gboolean
 ostree_repo_pull_with_options (OstreeRepo             *self,
@@ -3732,6 +3738,7 @@ ostree_repo_pull_with_options (OstreeRepo             *self,
   char **configured_branches = NULL;
   guint64 bytes_transferred;
   guint64 end_time;
+  guint64 zerodata_timeout_secs = DEFAULT_ZERO_BYTES_TIMEOUT_SECS;
   guint update_frequency = 0;
   OstreeRepoPullFlags flags = 0;
   const char *dir_to_pull = NULL;
@@ -3740,6 +3747,7 @@ ostree_repo_pull_with_options (OstreeRepo             *self,
   g_autoptr(GVariantIter) collection_refs_iter = NULL;
   g_autofree char **override_commit_ids = NULL;
   g_autoptr(GSource) update_timeout = NULL;
+  g_autoptr(GSource) zerodata_timeout_source = NULL;
   gboolean opt_per_object_fsync = FALSE;
   gboolean opt_gpg_verify_set = FALSE;
   gboolean opt_gpg_verify_summary_set = FALSE;
@@ -3801,6 +3809,7 @@ ostree_repo_pull_with_options (OstreeRepo             *self,
       (void) g_variant_lookup (options, "timestamp-check", "b", &pull_data->timestamp_check);
       (void) g_variant_lookup (options, "timestamp-check-from-rev", "s", &pull_data->timestamp_check_from_rev);
       (void) g_variant_lookup (options, "max-metadata-size", "t", &pull_data->max_metadata_size);
+      (void) g_variant_lookup (options, "zerodata-timeout-secs", "t", &zerodata_timeout_secs);
       (void) g_variant_lookup (options, "append-user-agent", "s", &pull_data->append_user_agent);
       opt_n_network_retries_set =
         g_variant_lookup (options, "n-network-retries", "u", &pull_data->n_network_retries);
@@ -4905,6 +4914,11 @@ ostree_repo_pull_with_options (OstreeRepo             *self,
       g_source_attach (update_timeout, pull_data->main_context);
     }

+  zerodata_timeout_source = g_timeout_source_new_seconds (zerodata_timeout_secs);
+  g_source_set_priority (zerodata_timeout_source, G_PRIORITY_HIGH);
+  g_source_set_callback (zerodata_timeout_source, zerodata_timeout_cb, pull_data, NULL);
+  g_source_attach (zerodata_timeout_source, pull_data->main_context);
+
   /* Now await work completion */
   while (!pull_termination_condition (pull_data))
     g_main_context_iteration (pull_data->main_context, TRUE);
@@ -5150,6 +5164,8 @@ ostree_repo_pull_with_options (OstreeRepo             *self,
   g_main_context_unref (pull_data->main_context);
   if (update_timeout)
     g_source_destroy (update_timeout);
+  if (zerodata_timeout_source)
+    g_source_destroy (zerodata_timeout_source);
   g_strfreev (configured_branches);
   g_clear_object (&pull_data->fetcher);
   g_clear_pointer (&pull_data->extra_headers, (GDestroyNotify)g_variant_unref);
-- 
2.31.1
cheese commented 2 years ago

Met the same issue on CentOS Stream 9 with rpm-ostree-2022.8-2.el9.x86_64

cheng0214 commented 1 year ago

Met the same issue on ubuntu-16.04 with ostree-2022.4 libostree: Version: '2022.4' Git: v2022.4 Features: