prefix-dev / pixi

Package management made easy
https://pixi.sh
BSD 3-Clause "New" or "Revised" License
3.05k stars 169 forks source link

Failed to download package / connection closed before message completed #876

Closed william-alhant closed 1 month ago

william-alhant commented 7 months ago

Checks

Reproducible example

Environment/reproduction os: Linux RHEL7 pixi version: 0.14

pixi.toml:

[project]
[...]
channels = ["https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge"]
platforms = ["linux-64", "linux-aarch64"]

[dependencies]
sysroot_linux-64 = "~=2.17.0"
python = "~=3.8.0"
mypy = "~=0.910.0"
flake8 = "~=4.0.0"
pylint = "~=2.14.0"
prettytable = "~=0.7.0"
ipython = "~=7.14.0"
pandas = "~=1.4.0"
requests = "~=2.26.0"
gcc = "~=11.3.0"
gxx = "~=11.3.0"
zlib = "~=1.2.0"
sqlite = "~=3.35.0"
scons = "~=4.6.0"

[system-requirements]
linux = "3.10.0"

Issue description

Hello,

Thanks for developping pixi, I think it is quite nice in my personal opinion, and I would like to use it at work on the project I'm working on. That being said, I am encountering some error when downloading packages.

The error looks like this:

  × failed to download package https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge/noarch/typing_extensions-4.9.0-pyha770c72_0.conda
  ├─▶ Request error: error sending request for url (https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge/noarch/typing_extensions-4.9.0-pyha770c72_0.conda): connection closed
  │   before message completed
  ├─▶ error sending request for url (https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge/noarch/typing_extensions-4.9.0-pyha770c72_0.conda): connection closed before message
  │   completed
  ╰─▶ connection closed before message completed

I ran this a few times, it does not always fail on the same package. As you can see, I am using a JFrog/artifactory mirror (set up as channel in pixi.toml). I can't find if this is a client side issue or a server side issue. From my perspective, it seems like it could be either some abnormal connection termination and/or maybe an oversight in error handling. The error occurs approximately a couple of minutes after downloading/installation starts, when it runs on a loaded machine. I have had more success when running on a less loaded machine.

Here are some logs I gathered by setting 'RUST_LOG=hyper::proto::h1::conn=debug,hyper::client=debug':

DEBUG hyper::client::pool: pooling idle connection for ("https", artifactory.eu02.arm.com)
DEBUG hyper::client::pool: reuse idle connection for ("https", artifactory.eu02.arm.com)
DEBUG hyper::client::pool: reuse idle connection for ("https", artifactory.eu02.arm.com)
DEBUG hyper::client::pool: reuse idle connection for ("https", artifactory.eu02.arm.com)
DEBUG hyper::client::pool: reuse idle connection for ("https", artifactory.eu02.arm.com)
DEBUG hyper::client::connect::dns: resolving host="artifactory.eu02.arm.com"
DEBUG hyper::client::connect::dns: resolving host="artifactory.eu02.arm.com"
DEBUG hyper::client::connect::dns: resolving host="artifactory.eu02.arm.com"
DEBUG hyper::proto::h1::conn: incoming body is content-length (50124 bytes)
DEBUG hyper::proto::h1::conn: incoming body is content-length (53242 bytes)
DEBUG hyper::proto::h1::conn: parse error (connection closed before message completed) with 0 bytes
DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
DEBUG hyper::proto::h1::conn: parse error (connection closed before message completed) with 0 bytes
DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
DEBUG hyper::client::connect::http: connecting to [...]:443
DEBUG hyper::client::connect::dns: resolving host="artifactory.eu02.arm.com"
DEBUG hyper::client::connect::http: connecting to [...]:443
  × failed to download package https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge/noarch/typing_extensions-4.9.0-pyha770c72_0.conda
  ├─▶ Request error: error sending request for url (https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge/noarch/typing_extensions-4.9.0-pyha770c72_0.conda): connection closed
  │   before message completed
  ├─▶ error sending request for url (https://artifactory.eu02.arm.com/artifactory/api/conda/conda-forge/noarch/typing_extensions-4.9.0-pyha770c72_0.conda): connection closed before message
  │   completed
  ╰─▶ connection closed before message completed

Expected behavior

Expected behavior is to handle the error more gracefully and continue the download/installation if possible.

zen-xu commented 7 months ago

I encountered the same issue, the first download attempt would fail, but the second would succeed.

william-alhant commented 7 months ago

I did a bit more testing. I tried to download packages from the same server with curl, and then extract them with tar/unzip: it is pretty fast, and nothing bad happens. So now I would lean towards an issue in pixi/rattler or reqwest/hyper...

baszalmstra commented 7 months ago

Thats weird. I have seen this as well but I haven't been able to figure out why you get:

DEBUG hyper::proto::h1::conn: parse error (connection closed before message completed) with 0 bytes
DEBUG hyper::proto::h1::conn: error shutting down IO: Broken pipe (os error 32)
DEBUG hyper::client::client: client connection error: error shutting down connection: Broken pipe (os error 32)
william-alhant commented 6 months ago

I also don't know why I get this. Is there any other log or test that could help?

william-alhant commented 6 months ago

I tried to sample a call stack at random during download/install, with the pstack tool, on a recompiled debug build, and got this

Thread 9 (Thread 0x2adc0daf4700 (LWP 18931)):
#0  0x00002adc0bfe3d47 in flock () from /lib64/libc.so.6
#1  0x0000557b03a077d3 in fslock::unix::lock::h7a6e969c7a5dd50c (fd=13) at src/unix.rs:322
#2  0x0000557b03a07f80 in fslock::LockFile::lock::ha7b3f0a00b46253b (self=0x2adc0dae3ef0) at src/lib.rs:167
#3  0x0000557b03a0802a in fslock::LockFile::lock_with_pid::hc0667a16a00d9df6 (self=0x2adc0dae3ef0) at src/lib.rs:208
#4  0x0000557b0353af18 in rattler_networking::authentication_storage::backends::file::FileStorage::lock::heb4a62236029d1cf (self=0x557b07403700) at src/authentication_storage/backends/file.rs:55
#5  0x0000557b0353cfc2 in _$LT$rattler_networking..authentication_storage..backends..file..FileStorage$u20$as$u20$rattler_networking..authentication_storage..StorageBackend$GT$::get::hab54fcc510bbb6a5 (self=0x557b07403700, host=...) at src/authentication_storage/backends/file.rs:102
#6  0x0000557b03534865 in rattler_networking::authentication_storage::storage::AuthenticationStorage::get::h004c48059d2c2053 (self=0x557b075d5b20, host=...) at src/authentication_storage/storage.rs:88
#7  0x0000557b03535dd4 in rattler_networking::authentication_storage::storage::AuthenticationStorage::get_by_url::h2bd86cb3a1935aab (self=0x557b075d5b20, url=...) at src/authentication_storage/storage.rs:136
#8  0x0000557b03526e08 in _$LT$rattler_networking..authentication_middleware..AuthenticationMiddleware$u20$as$u20$reqwest_middleware..middleware..Middleware$GT$::handle::_$u7b$$u7b$closure$u7d$$u7d$::h1923ea75e8e75464 () at src/authentication_middleware.rs:29
#9  0x0000557b03a0bd2e in _$LT$core..pin..Pin$LT$P$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h5865df82401df94a (self=..., cx=0x2adc0daf21a0) at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125
#10 0x0000557b03a0bd2e in _$LT$core..pin..Pin$LT$P$GT$$u20$as$u20$core..future..future..Future$GT$::poll::h5865df82401df94a (self=..., cx=0x2adc0daf21a0) at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/future/future.rs:125
#11 0x0000557b02923822 in reqwest_middleware::client::ClientWithMiddleware::execute_with_extensions::_$u7b$$u7b$closure$u7d$$u7d$::h95b8bea2d2aa0552 () at /home/wilalh01/.cargo/registry/src/index.crates.io-6f17d22bba15001f/reqwest-middleware-0.2.4/src/client.rs:160
#12 0x0000557b02923360 in reqwest_middleware::client::RequestBuilder::send::_$u7b$$u7b$closure$u7d$$u7d$::hf89aaa89b10acd4f () at /home/wilalh01/.cargo/registry/src/index.crates.io-6f17d22bba15001f/reqwest-middleware-0.2.4/src/client.rs:314
#13 0x0000557b01fb7018 in rattler_package_streaming::reqwest::tokio::get_reader::_$u7b$$u7b$closure$u7d$$u7d$::hb5c5c819077f717f () at /home/wilalh01/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rattler_package_streaming-0.18.0/src/reqwest/tokio.rs:35
#14 0x0000557b01fb769a in rattler_package_streaming::reqwest::tokio::extract_conda::_$u7b$$u7b$closure$u7d$$u7d$::ha861a16ffa2bc992 () at /home/wilalh01/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rattler_package_streaming-0.18.0/src/reqwest/tokio.rs:100
#15 0x0000557b01fb894c in rattler_package_streaming::reqwest::tokio::extract::_$u7b$$u7b$closure$u7d$$u7d$::h24e32e6b0fcf6afe () at /home/wilalh01/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rattler_package_streaming-0.18.0/src/reqwest/tokio.rs:132
#16 0x0000557b021a2e9f in rattler::package_cache::PackageCache::get_or_fetch_from_url_with_retry::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::hce10fa620d924b68 () at /home/wilalh01/.cargo/registry/src/index.crates.io-6f17d22bba15001f/rattler-0.18.0/src/package_cache.rs:213
[...]

There might be some sort of locking issue

william-alhant commented 6 months ago

I checked with strace and it seems to confirm my suspicion:

44609 flock(12, LOCK_EX <unfinished ...>
44608 flock(13, LOCK_EX <unfinished ...>
44607 flock(15, LOCK_EX <unfinished ...>
44614 flock(16, LOCK_EX <unfinished ...>
44613 flock(17, LOCK_EX <unfinished ...>
44612 flock(14, LOCK_EX <unfinished ...>
44610 flock(18, LOCK_EX|LOCK_NB)        = -1 EAGAIN (Resource temporarily unavailable) <0.000422>
44610 flock(18, LOCK_EX <unfinished ...>
44611 <... flock resumed> )             = 0 <6.045419>
44611 flock(19, LOCK_UN)                = 0 <0.000261>
44609 <... flock resumed> )             = 0 <6.044859>
44609 flock(12, LOCK_UN <unfinished ...>
44611 flock(19, LOCK_EX|LOCK_NB <unfinished ...>
44609 <... flock resumed> )             = 0 <0.000421>
44611 <... flock resumed> )             = -1 EAGAIN (Resource temporarily unavailable) <0.000267>
44611 flock(19, LOCK_EX <unfinished ...>
44609 flock(12, LOCK_EX|LOCK_NB)        = -1 EAGAIN (Resource temporarily unavailable) <0.000278>
44609 flock(12, LOCK_EX <unfinished ...>
44618 +++ exited with 0 +++
44608 <... flock resumed> )             = 0 <12.140519>
44608 flock(13, LOCK_UN)                = 0 <0.000296>
44608 flock(13, LOCK_EX|LOCK_NB)        = -1 EAGAIN (Resource temporarily unavailable) <0.000275>
44608 flock(13, LOCK_EX <unfinished ...>
44607 <... flock resumed> )             = 0 <15.201206>
44607 flock(15, LOCK_UN)                = 0 <0.000319>
44607 flock(15, LOCK_EX|LOCK_NB)        = -1 EAGAIN (Resource temporarily unavailable) <0.000258>
44607 flock(15, LOCK_EX <unfinished ...>
44614 <... flock resumed> )             = 0 <18.213597>
44614 flock(16, LOCK_UN)                = 0 <0.000410>
44613 <... flock resumed> )             = 0 <18.215742>
44613 flock(17, LOCK_UN)                = 0 <0.000271>
44612 <... flock resumed> )             = 0 <18.217978>
44614 flock(16, LOCK_EX|LOCK_NB)        = -1 EAGAIN (Resource temporarily unavailable) <0.000044>
44614 flock(16, LOCK_EX <unfinished ...>
44612 flock(14, LOCK_UN <unfinished ...>
44613 flock(17, LOCK_EX|LOCK_NB <unfinished ...>
44612 <... flock resumed> )             = 0 <0.000388>
44613 <... flock resumed> )             = -1 EAGAIN (Resource temporarily unavailable) <0.000250>
44613 flock(17, LOCK_EX <unfinished ...>
44612 flock(14, LOCK_EX|LOCK_NB)        = -1 EAGAIN (Resource temporarily unavailable) <0.000196>
44612 flock(14, LOCK_EX <unfinished ...>
44610 <... flock resumed> )             = 0 <30.000863>

Some of the flock syscalls take seconds to finish.

Actually, it might be an issue with my environment. My home directory (where pixi/rattler stores/locks credentials.json) is mounted on NFS, and in my environment, NFS can often be under heavy load, and this would not be the first NFS related issue I have seen.

Is there a way to use another location for credentials.json? I would like to see what happens when it is on regular local storage, maybe it would solve my issues...

Edit: I tried recompiling with a change replacing the default AuthenticationStorage, with a different path for credentials.json, not in my NFS home directory. It really seems to solve my issues. The download part of the installation goes really fast now.

I think it might be nice:

william-alhant commented 6 months ago

to make the path configurable (unless it is already and I missed it)

I just saw #994 is already implementing that part :)

wolfv commented 6 months ago

Hi @william-alhant, thanks for debugging this!

I think we should solve this by reading the entire authentication file once and then just keeping it in cache. Thanks for debugging this!

wolfv commented 6 months ago

@william-alhant I've implemented a cache here: https://github.com/mamba-org/rattler/pull/573

I hope that that would help with the issues you were seeing. We should only try to read the file once in the future, and lock it also only once. File locks are very slow over NFS.

baszalmstra commented 1 month ago

Since both suggestions have been implemented Im closing this for now.

Feel free to reopen if the issue persists.