Open AMDmi3 opened 4 years ago
Further development that this change would unlock, even with partial implementation:
TODO: decide on handling circular data dependencies and asynchronous updates. For instance, links:
Other cases include:
For the note, first phase of delta updates is currently being deployed. The development resulted in major optimizations in other places, namely in repoproc deserializer, repositories
table update (bad SQL execution plan led to 40x overhead, 25% of overall database update time), and repo_metapackages
table update (thrashing due to inserting unordered items led to excess I/O and several extra minutes to update).
Unfortunately, some inevitable pessimizations were introduced too, which affect update time when everything is updated (such as first update on an empty database). Extra overhead is about 20%, +2 minutes for hashing (current package hash implementation which involves JSON may be improved) and +2 minutes for extra database queries. Of course, these are outweighted by partial update performance improvement (~2.5x currently).
Status update: first phase of delta updates allowed to do update cycle in under 50 minutes (most time spent in fetch and parse).
The next step would be performance testing of update process which has to fetch previous state of packages from the database - there seem to be no way to avoid it if we want to generate history properly. If it turns out to be too slow, stored procedures may be investigated.
After that, we'll need to convert derived table updates without introducing huge pessimization when most packages change. There are two distinct kinds of such derived tables:
The former do not impose any limitations and may be updated in parallel to packages, with the same granularity. The latter lead to write amplification and n² patterns when done naively, e.g. when a lot of project updates modify single maintainer/repository/link.
The most important case is when e.g. maintainer IDs need to be known before updating binding tables or packages themselves (e.g. when we switch to storing maintainer/link IDs instead of verbatim texts).
The solution here is to cache stuff in memory and avoid updating referenced objects multiple times. That is, store maintainer-id mapping in memory, and use it to get IDs and decide whether maintainers need to be created. If it's too big for the memory, we can flush it (completely, or more optimally only some least recently used entries) periodically.
Regarding tables indexed by project name, the important thing is not to duplicate code for bulk and discrete updates. This may be achieved by using a temporary table which lives for the duration of update transaction and holds updated project names. With it we can use the same bulk queries, but limit them with a subset of projects. If the subset is big enough, it may be ignored and bulk update used.
Binding tables done, -30% database update time. Next big time consumer is url_relations
. There are ways to optimize both url_relations
table construction and updating has_related
flag in metapackages.
By the way, metapackages table is around 4x of it's minimal size (as can be seen after vacuum), which is caused by multiple consequent complete updates, which leads to N dead copiles for each tuple. Apart from only updating affected projects, optimization is to introduce conditions to UPDATE
queries not to needlessly rewrite each row on each pass.
Also the shortcoming of current database schema is usage of last_seen
fields (in metapackages, maintainers, links). For projects and maintainers, there should instead be orphaned_at
field which could be set when all related packages disappear. For links, there could be a similar flag which we could update weekly by discovering all unreferenced links. After some time, these could be removed.
Delta updates were planned for years, but still not implemented as it's proven to be too huge task to do in one go. We need a new plan to gradually implement it in smaller steps. Here it is:
database_update
in repology_update.py into dedicated moduleupdate_finish
query into a lot of smaller subtasks. This would also help to profile themmaintainer_repo_metapackages
table and SQL code for its update, and do drop version information from projects/metapackages tablepackages
table to get the number of packages)links
table first)