nroi / flexo

a central pacman cache
MIT License
172 stars 10 forks source link

serving databases (core.db, extra.db, community.db, ...) #82

Open bernhard-da opened 2 years ago

bernhard-da commented 2 years ago

hi @nroi , first off, thx a lot for providing flexo. it is a really great and very useful piece of software!

i have however experienced one issue; I am working on a fully updated arch-system with the following flexo.toml in which I changed the path of the cache-directory to /storage/...

flexo.toml

cache_directory = "/storage/flexo/pkg"
low_speed_time_secs = 3
connect_timeout = 3000
mirrorlist_fallback_file = "/storage/flexo/state/mirrorlist"
mirrorlist_latency_test_results_file = "/storage/flexo/state/latency_test_results.json"
listen_ip_address = "0.0.0.0"
port = 7878
mirror_selection_method = "auto"
mirrors_predefined = []
num_versions_retain = 1
[mirrors_auto]
    mirrors_status_json_endpoint_fallbacks = [
        "https://raw.githubusercontent.com/nroi/archlinux-mirrors-status-fallback/main/mirrorlist.json",
    ]
    mirrors_blacklist = [ ]
    https_required = true
    ipv4 = true
    ipv6 = false
    max_score = 2.5
    num_mirrors = 8
    mirrors_random_or_sort = "sort"
    timeout = 350
    refresh_latency_tests_after = "8 days"
    allowed_countries = ["DE", "AT", "NL", "CZ"]

flexo is serving cached packages for all clients in my lan works flawlessly. however, i see the following entries in the server-log for all enabled repos when I do a pacman -Syu on a client.

log

{timestamp} {server} flexo[8289]: [{timestamp} INFO  flexo] Request served [CACHE MISS]: "core/os/x86_64/core.db"
{timestamp} {server} flexo[8289]: [{timestamp} INFO  flexo] core/os/x86_64/core.db.sig is not available at https://mirror.f4st.host/archlinux/
{timestamp} {server} flexo[8289]: [{timestamp} INFO  flexo] core/os/x86_64/core.db.sig was unavailable at all remote mirrors.
{timestamp} {server} flexo[8289]: [{timestamp} INFO  flexo] Request served [NO PAYLOAD]: "core/os/x86_64/core.db.sig"

i have tried with different mirrors but I cannot manage that also the databases are provided from flexo. As I have quite a large number of internal clients the traffic (e.g from community.db) adds up over time. Do I have to set a specific config-setting to make this work or do you have an idea where I could start looking?

nroi commented 2 years ago

Hi @bernhard-da, although the logs may look like there is some kind of problem, Flexo works as intended here: Notice that the messages saying xxx is not available and xxx was unavailable at all remote mirrors only appear for those files that end with .db.sig, but .db files are served just fine. Flexo does not find db.sig files because they are simply not available at the remote mirror. Have a look at this thread where one of the Arch Linux maintainers explains:

Because the databases are not signed yet. The process for doing that is still being worked out...

So, the current status (even if you don't use Flexo) is that Pacman requests those files, receives a 404 response and then just silently ignores the response.

As I have quite a large number of internal clients the traffic (e.g from community.db) adds up over time.

Files ending with .db are another story: Flexo serves the .db files, but it does not cache them. This is intentional, and it cannot be changed at this moment. If Flexo would cache database files like normal files, then clients would eventually receive outdated database files. Of course, one could implement some special caching logic for database files and only cache them for a configurable duration (e.g., so you can configure Flexo to serve the database from cache if the cached version is not more than one hour old). But I decided against this because I found that the benefit does not justify the added complexity. The community.db file is currently just ~ 6 MB, so I never saw an issue in downloading this file a couple of times.

May I ask how fast your internet connection is? Did you notice this behavior because pacman was slow to download the database files, or did you notice this just by inspecting Flexo's logs?

bernhard-da commented 2 years ago

hi @nroi thx a lot for your detailled answer; indeed I was not really wondering about the .sig files but the the [CACHE MISS] for the .db files;

your explanation does make perfect sense. to answer your question:

May I ask how fast your internet connection is? 
Did you notice this behavior because pacman was slow to download the database files, or did you notice this just by inspecting Flexo's logs?

yes, i have a unreliable internet-connection which is often slow too (max around 20mbit down) and also my isp throttles speeds after a specific amount of downloaded data; so i realized that pacman was slow (on many clients) downloading the same .db files and I also monitored the (total) size of downloaded .db files was quite high.

nroi commented 2 years ago

i have a unreliable internet-connection which is often slow too (max around 20mbit down) and also my isp throttles speeds after a specific amount of downloaded data; so i realized that pacman was slow (on many clients) downloading the same .db files and I also monitored the (total) size of downloaded .db files was quite high.

I see. I guess there are other users with similar issues. In that case, I might reconsider if it makes sense to implement some caching mechanism for database files. This should probably be disabled by default, and it should be configurable to determine the duration after which locally stored database files are considered stale and redownloaded again.

But don't expect this to be implemented very soon, I'm currently prioritizing changes that improve the code-maintainability over new features.

bernhard-da commented 2 years ago

@nroi fair enough. thx again for your comments and working on flexo :)

Zebradil commented 2 years ago

I also see an opportunity of improvement here. Maybe it make sense to check how pacman handles this, because, when I don't use flexo, database files are cached somehow.

sudo pacman -Sy
:: Synchronizing package databases...
 core is up to date
 extra is up to date
 community is up to date
 multilib is up to date

But when I use flexo, the database files are always being downloaded.

I can't check how pacman works right now, but I'll try to figure this out later.

nroi commented 2 years ago

@Zebradil Thanks for pointing this out. pacman sends the If-Modified-Since header, for example:

If-Modified-Since: Sun, 30 Jan 2022 10:17:26 GMT

Which means that the mirror may respond with a 304 Not Modified instead of sending the entire payload.

The timestamp seems to be set according to the Modify or Change timestamp of the file in /var/lib/pacman/sync. If you run sudo touch -m /var/lib/pacman/sync/core.db, then pacman sends a new If-Modified-Since timestamp.

It makes sense for flexo to behave like pacman, so this is something that should change in flexo.

nroi commented 1 year ago

Feature draft

This post is intended to summarize all information required to implement this feature, as well as information about what value this feature adds to Flexo.

Problem description:

Database files are currently not cached. With a large number of clients, this can add up in traffic. This is relevant especially for users with a slow internet connection or an ISP that throttles speed after a given amount of data has been downloaded (see also: https://github.com/nroi/flexo/issues/82#issuecomment-974785049).

Background information:

Originally, it was not planned to implement any kind of caching for database files to avoid that Flexo serves any outdated files. However, it turns out that it should actually be possible to implement some kind of caching: Consider the case when pacman is used without Flexo. When pacman requests a database file, then it sends the If-Modified-Since header. The remote mirror then either serves this file as usual if the database file on the remote mirror is more recent than the header, or it just returns 304 Not Modified no more up-to-date file is available. We therefore aim to implement something comparable for Flexo: If a new database file is available at the remote mirror, then Flexo should always serve this file instead of a stale, cached version. On the other hand, if Flexo already has the database file in a version that is more recent or just as recent as the version on the remote mirror, then no new download from a remote mirror should be required.

Proposed solution: