Open rjmateus opened 4 months ago
Download job
Since the download is decoupled from metadata we can define multiple download strategies to try to speed up the most important packages.
We can add two new fields, and modify the constraint in one:
* Add field for download priority, as a integer * Add status field, that controls if the package is already downloaded or not * ~Change package path field to allow nulls~
Each package as soon it's inserted in the database should be assigned a download priority number. This would allow us to control which packages should be downloaded first.
UI should show which packages are already download for a channel, and how many are missing. It should also allow to force download (or increase download priority for package).
Is the download priority an attribute of a package? To me that sounds like we're complecting two different things. A package is just that, a package. When to download is different type of information, and there can be more than one answer to the question "when do we want to download package foo?", depending on the context we ask the question in.
For example, packages can be added to different channels. Let's imagine that foo-2.1
is in channel A
and channel B
. Let's say in channel A
, package foo-2.1
has a moderate priority because there are also versions foo-3.0
and foo-4.0
, and channel A
wants to download latest first. Channel B
also wants to download latest first, but imagine in channel B
we only have foo<3.0
(e.g. foo-2.0
and foo-2.1
). Suddenly, foo-2.1
is high-priority for channel B
. To me it sounds like a nightmare to organize all packages with global priorities. I think a better approach is to decouple the priority from the package, and add it to specific download jobs.
For example, a missing package requested by a client needs to be downloaded asap, irrespective of any priority decided at the time of syncing metadata. Same would be true if we had a force download button on the WebUI. What follows is that it needs to be possible to send a download job with a priority to the downloader (e.g. {"priority": "10", "url": "https://download.opensuse.org/repositories/systemsmanagement:/Uyuni:/Stable:/openSUSE_Leap_15-Uyuni-Client-Tools/openSUSE_Leap_15.0/noarch/ansible.rpm", "local_path": "/var/spacewalk/packages/1/.../ansible.rpm"}
).
That leads me to the downloader itself. My current idea is to have two high-level components, one that receives download jobs and takes care of the download, let's call it downloader. The second component monitors not downloaded packages and creates download jobs for them, let's call it watcher.
The downloader has an internal priority queue, detects duplicated download jobs and, if needed, can re-prioritize, the download job. It controls download threads that do the actual downloading. The watcher is concerned with the different strategies, e.g. the ones you describe.
With these two separate components, we open the door for other components to take the place of watcher. E.g. tomcat can also send a download job to downloader, e.g. when a user clicks on a button or calls a specific API endpoint.
Open questions
How to deal with missing packages
If a package is not yet synchronized we would need to define the strategy to deal with it. The simplest approach is to return a 404 when the clients tries to access a package that is not synchronized yet, and schedule the download of that package as fast as possible from the upstream repository.
Maybe we can have a timeout before we return 404, i.e. request comes on
-> package is missing
-> create download job
-> check if package exists periodically, until timeout
-> server package / 404
. That only makes sense if the zypper/dnf/apt
package managers wait for some time.
A more user-friendly approach, but also the most difficult to. Possible problems:
* Multiple minions requesting the package at the same time. We don't want to start the download multiple time, so we need a central control mechanism
See above, the downloader could be a daemon capable of de-duplicating requests.
* During the download time, one thread would blocked on server side, waiting for the download to finish. This thread will also probably need to have a database connection. This can lead the server to be unusable.
This depends on how much information we want to save in the database. If we immediately return 404, the database connection can be used to read required information to create a download job and closed right after. Of course, if we want to write back to the database after the download finished, we'll need another connection at the later time, or keep the connection live for the whole time.
Add a status column to
rhnchannel
Should we add a new column to
rhnchannel
to control when the first metatada sync is run, and when all package from the download strategy have finished download? This field can then be used in channel assignment control, to make sure the channel is not fully used until the first reposync runned.
Such an attribute might be useful for CLM, where we need to ensure that all packages are present when we freeze projects. We can't have "unfulfilled futures" in promoted CLM environments.
Another open question is:
That can happen when upstream repos are rebuilt and they drop old versions (e.g. OBS repos, but also Alma/Rocky etc.). IMO the downloader could return an error and either the watcher or Tomcat or whatever process initiates the download needs to handle it, e.g. by triggering a new lzreposync metadata refresh.
That's all I can think of so far, and I hope these ideas help with the discussion.
This issue aims to share some ideas about the lazy repos-sync project that is being developed in the context of google summer of code.
This is my personal view, and should not be taken as hard requirements. What will be implemented should be part of an RFC.
The main goal of this feature is to decouple the metadata processing from the package download.
To achieve it, it will use the repository metatada available, process it and insert all the channels and package data in the database, without downloading any package. This will skip importing changelogs and some of the advanced properties of the package file lists. After that process the SUMA package metadata can be generated and users can even start to prepare CLM projects, since they only depend on channels metadata saved on the database.
Download can be done by a new scalable process (taskomatic job, or similar to what we have on CoCo attestation).
Download job
Since the download is decoupled from metadata we can define multiple download strategies to try to speed up the most important packages.
We can add two new fields, and modify the constraint in one:
Each package as soon it's inserted in the database should be assigned a download priority number. This would allow us to control which packages should be downloaded first.
UI should show which packages are already download for a channel, and how many are missing. It should also allow to force download (or increase download priority for package).
Next, some examples of download priority strategies that can/should be available
Strategy: all
Download all packages, all packages with same priority except the client tools channels that should have higher priority.
Strategy: Latest first
Download all, but client tools and latest version of each package should be downloaded first
Strategy: Latest only
Only download the latest version of each package. All other packages should have download priority set to 0. This kind of strategy can only be implemented when the issue of missing packages download are handled.
Strategy: On-demand
Only download packages when a minion asks for it. All packages should have download priority set to 0. This kind of strategy can only be implemented when the issue of missing packages download are handled.
Open questions
How to deal with missing packages
If a package is not yet synchronized we would need to define the strategy to deal with it. The simplest approach is to return a 404 when the clients tries to access a package that is not synchronized yet, and schedule the download of that package as fast as possible from the upstream repository.
A more user-friendly approach, but also the most difficult to. Possible problems:
Add a status column to
rhnchannel
Should we add a new column to
rhnchannel
to control when the first metatada sync is run, and when all package from the download strategy have finished download? This field can then be used in channel assignment control, to make sure the channel is not fully used until the first reposync runned.Usefull links