openlawlibrary / taf

The Archive Framework
GNU Affero General Public License v3.0
10 stars 7 forks source link

fix: Make Updater Faster #434

Closed ronantakizawa closed 1 month ago

ronantakizawa commented 1 month ago

Description (e.g. "Related to ...", etc.)

Make the Updater faster by running the first 2 functions concurrently.

Code review checklist (for code reviewer to complete)

dgreisen commented 1 month ago

To elaborate on Renata's comment, my expectation is that we should be able to get an "empty" update to take not much more time than a git fetch. I.e. Well under a second.

Possible reasons I can imagine it is so slow:

renatav commented 1 month ago

Repositories which are cloned to temp are bate repositories, but we fetch all branches, even thought that is not necessary. What'd I'd like us to do first is determine how much time is spent running individual steps. We don't mind you taking a stab at any of the discovered issues if you want to, but for now, the idea is to just figure out what we need to focus on the most. I was planning to do some work on the updater this month

ronantakizawa commented 1 month ago

Here's the runtime of each function after testing the updater on one of the repos with changed events and unchanged events (mohicanlaw):

Changed

(base) ronantakizawa@temps-MacBook-Pro-6 law % taf repo update Updating repository

Updating repository mohicanlaw/law

Checking if local repositories are clean...

Step check_if_local_repositories_clean took 0.00 seconds Step set_existing_repositories took 0.00 seconds

Cloning repository and running TUF updater...

Validation repository: cloning repository

Validation repository: trying to clone from git@github.com:mohicanlaw/law.git

Validation repository: successfully cloned from git@github.com:mohicanlaw/law.git

Running TUF validation of the authentication repository...

WARNING: Could not validate authentication repository at revision 3b42bb70de3e62658d1c9b01e11bbac621932eba due to error: Final root.json is expired

Successfully validated authentication repository mohicanlaw/law

Step clone_remote_and_run_tuf_updater took 8.01 seconds

Validating out of band commit and update type

Step validate_out_of_band_and_update_type took 0.00 seconds

Cloning or updating user's authentication repository...

Step clone_or_fetch_users_auth_repo took 2.44 seconds Step load_target_repositories took 0.13 seconds

Finished cloning target repositories

Step clone_target_repositories_to_temp took 16.77 seconds

Validating initial state of target repositories...

Checking initial state of repositories

Step determine_start_commits took 0.01 seconds Step get_targets_data_from_auth_repo took 0.00 seconds Step check_if_local_target_repositories_clean took 0.39 seconds Step get_target_repositories_commits took 8.34 seconds

Validating target repositories...

Validation of target repositories finished

Step validate_target_repositories took 0.00 seconds Step validate_and_set_additional_commits_of_target_repositories took 0.00 seconds Step update_users_target_repositories took 0.05 seconds

Merging commits into target repositories...

Step merge_commits took 0.33 seconds Step remove_temp_repositories took 0.01 seconds Step set_target_repositories_data took 0.00 seconds

Repository mohicanlaw/law-xml: found commits succeeding the last authenticated commit on branch master: f0d0cf4225. These commits were not merged into master

Step print_additional_commits took 0.00 seconds

mohicanlaw/law Merging commit 3b42bb70de into branch master

Called LifecycleStage.REPO handler. Event: changed

WARNING: Config file /Users/ronantakizawa/Documents/projects/nyu/407/config.json not found. Scripts might not be executed successfully

WARNING: No config data. Scripts might not be executed successfully

Called LifecycleStage.UPDATE handler. Event: changed

completed in 36 seconds

Unchanged

(base) ronantakizawa@temps-MacBook-Pro-6 law % taf repo update --profile Profiling... Updating repository

Updating repository mohicanlaw/law

Checking if local repositories are clean...

Step check_if_local_repositories_clean took 0.00 seconds Step set_existing_repositories took 0.00 seconds

Cloning repository and running TUF updater...

Validation repository: cloning repository

Validation repository: trying to clone from git@github.com:mohicanlaw/law.git

Validation repository: successfully cloned from git@github.com:mohicanlaw/law.git

Running TUF validation of the authentication repository...

Successfully validated authentication repository mohicanlaw/law

Step clone_remote_and_run_tuf_updater took 8.01 seconds

Validating out of band commit and update type

Step validate_out_of_band_and_update_type took 0.00 seconds

Cloning or updating user's authentication repository...

Step clone_or_fetch_users_auth_repo took 3.40 seconds Step load_target_repositories took 0.13 seconds

Finished cloning target repositories

Step clone_target_repositories_to_temp took 18.68 seconds

Validating initial state of target repositories...

Checking initial state of repositories

Step determine_start_commits took 0.02 seconds Step get_targets_data_from_auth_repo took 0.00 seconds Step check_if_local_target_repositories_clean took 0.27 seconds Step get_target_repositories_commits took 8.29 seconds

Validating target repositories...

Validation of target repositories finished

Step validate_target_repositories took 0.00 seconds Step validate_and_set_additional_commits_of_target_repositories took 0.00 seconds Step update_users_target_repositories took 0.05 seconds

Merging commits into target repositories...

Step merge_commits took 0.33 seconds Step remove_temp_repositories took 0.01 seconds Step set_target_repositories_data took 0.00 seconds

Repository mohicanlaw/law-xml: found commits succeeding the last authenticated commit on branch master: f0d0cf4225. These commits were not merged into master

Step print_additional_commits took 0.00 seconds

Called LifecycleStage.REPO handler. Event: unchanged

WARNING: Config file /Users/ronantakizawa/Documents/projects/nyu/407/config.json not found. Scripts might not be executed successfully

WARNING: No config data. Scripts might not be executed successfully

Called LifecycleStage.UPDATE handler. Event: unchanged

completed in 39 seconds Profiling completed

ronantakizawa commented 1 month ago

clone_remote_and_run_tuf_updater, clone_remote_and_run_tuf_updater, and get_target_repositories_commits have the longest runtimes.

Adding parallel processing within each function significantly reduces runtime.

dgreisen commented 1 month ago

for an "unchanged" update I'd expect something like this:

Some big items that I don't think we need to be doing:

what is this doing?

to be clear - parallelization helps a ton when there are changes to be pulled down, but if we are running this script every 15 minutes, vast majority of time there will be no changes, so we want to short-circuit when that happens.

renatav commented 1 month ago

I think that the biggest improvement improvement here would be to finish to stop the update process if there are no new authentication commits. One problem that I see with that is this:

I think that what we could do here is add a flag, which would allow you to specify if all steps should be run or if we can exit early.

get_target_repositories_commits is what fetches new target commits, which explains why it's so slow. I would like to see if only fetching specific branches would help. I ran git fetch --all and fetch of specific branches and it does not seem to me that there is a huge difference. We should properly test.

In terms of how much the update being faster would speed up the development, it would not have a big impact on the development of the updater, especially because that would actually make it more difficult to test steps which are executed later in the pipeline. By far the biggest issue here is lack of a testing framework which allows us to recreate any state in which the repositories can be. I would think that the faster update would make it way easier to work on codify. @n-dusan can confirm this.

This parallelization did actually speed up the update when there are no changes. On my machine, it's now about 25% faster, so provided that there are no regression, we can release this change.

dgreisen commented 1 month ago

I think that what we could do here is add a flag, which would allow you to specify if all steps should be run or if we can exit early.

Default should be to exit early. I don't see a need for a flag for end users, but if you need one for testing, that is fine.

  • if there are no new authentication commits, there could still be new pushes to target repositories

I'm pretty ok with not knowing if there are unauthenticated target commits at the end of the validated target commits. Definitely not worth pulling target commits just for that purpose.

  • if these target repositories are not allowed to contain unauthenticated repositories, that could be indicative of malicious pushes, or an error on our end (like incomplete merge) which should be detected

I think it's actually better that we don't know about these. that way a malicious person without access to key would be unable to break the update.

  • if these repositories are allowed to contained unauthenticated commits, this is not an error. The updater used to merge this commits, but I believe that that is no longer the case, so it's only the previous scenario that is problematic

correct it no longer does, now it just tells you there are commits we can merge. those are really just for the convenience of the developer, an end user should never be using them.

renatav commented 1 month ago

If there are invalid commits following valid ones, the updater will now partially update the repositiies. If we were to ignore invalid additional commits compared to how the updater is currently handling them, only the update status would be different and the updater would execute different scripts. When running the build command, a user has to pull down new changes first. They can still run build on top of invalid new commits that got pulled down. If we were to run validation from the first commit preceeding the first newly built auth commit, the validation would fail. If we ran it from the first new commit, I think that it would be successful. If an advanced user like Tom ran the uodater and then buld, with the uodater reporting no new changes and build failing to start, they might just manually pull. This can all easily happen if our CI for some reason partially merged commits. Maybe the answer here is to move some of those responsibilities to platform if we do not want the updater to provide such information to all end-users. We've had too many issues with the repositories caused by invalid commits, especially if the are not the newest ones