simplesurance / baur

An incremental task runner for mono repositories.
GNU General Public License v2.0
360 stars 11 forks source link

inputfiles: use git object IDs as input file content digests #502

Closed fho closed 8 months ago

fho commented 8 months ago

If the baur repository is also a git repository and the git binary is available, baur now uses git object IDs as file content digest. The object IDs for all tracked files are read from the git repository. If an inputfile is tracked and unmodified, the existing git object ID can be used. This does not work for files that are part of a git submodule. Depending on the number of input files and hardware this can be much faster. It saves CPU time because the digest does not need be calculated and disk I/O because all input files don't have to be read anymore. Only "git ls-files" needs to be executed and it's output parsed. The digest of an InputFile is still a SHA384. It is the hash of the aggregation of the git object id of the content and the SHA384 of it's relative file path.

If an file is not tracked, tracked but modified or part of a git submodule, "git hash-object" is executed to calculate the object ID. This ensures that the stored baur object ID for the file is the same, if it is later added to the repository and the task is rerun.

Executing "git hash-object" is slower then calculating the sha384 digest in baur itself. The typical baur usecase is that it runs in CI for a checked out git repository, where all files are tracked. Thereore it is neglegtible. This can be improved in a follow-up by calculating the object ID in baur without running an external command.

If the baur repository is not a git repository, file digests are calculated as before.

Closes #466

nocive commented 8 months ago

@FranciscoKurpiel please review