microsoft / azure-pipelines-agent

Azure Pipelines Agent πŸš€
MIT License
1.71k stars 858 forks source link

Working directory needs more efficient re-use or cleanup #1506

Open bergmeister opened 6 years ago

bergmeister commented 6 years ago

Agent Version and Platform

VSTS Type and Version

VSTS but agent is on-premise

What's not working?

We created new agent machines in Azure. For fast builds, the working directory is on the D drive, which is only around 30GB big. We have a big monolithic repo with multiple build definitions and it seems that the agent keeps multiple checkouts on disk when running different build definitions and branches although they all point to the same repository. Git was designed to have fast branch switching strategies and therefore I do not see this as necessary. This leads to the agent running out of disk space after a few builds in a few hours. It is not a reasonable solution to add cleanup steps to our builds because of that as proposed in #708 Therefore either the agent needs to have a setting to clean up the working directory afterwards or be more efficient in re-using the same repository for different branches and build definitions.

TingluoHuang commented 6 years ago

@bergmeister just let you know we do have a feature in our backlog to share .git folder across definitions.

bergmeister commented 6 years ago

@TingluoHuang Thanks for clarification. What is the timeline for it? It would be great to have public tracking for that. And why does the agent create a separate folder foreach build definition, even if it is from the same repo? From my point of view, all one needs is 1 folder per git remote for all build definitions.

ericsciple commented 6 years ago

@bergmeister Would a shared .git folder help in your case? The way we are thinking about the feature, each definition would still get it's own build directory for checkout and other folders.

Also have you considered the shallow fetch option?

How many build definitions do you have?

bergmeister commented 6 years ago

@ericsciple It would help but not solve the problem (I am already using shallow fetch). The repo is a couple of gigs and fully compiled its size triples but we have around 10 build definitions to build sub-components. Appropriate Azure VM sizes have only 16-32 GB of disk space on the temp disk (which I should be able to use because it is much more performant and gives me free cleanup when the machine shuts down overnights) The way how I think about it is that you only need one folder per repository. Git was designed to be able to switch branches very fast. One agent can only run one build at a time, so no concurrency issues and I think every build runs something like git clean -dfx before the start of a build. Why would you want/need a separate folder for each build definition?

ericsciple commented 6 years ago

Note, the sources directory is a subdirectory within the build directory. git clean only cleans the sources directory.

bryanmacfarlane commented 6 years ago

Yeah, as long as we keep it down to per agent, then we assure an agent runs one job at a time so no concurrency. Very important. The shared .git would basically allow you have many definitions share the .git on disk. Technically we don't even have to repoint .git via config - we just need to ensure it's keyed by the git repo location instead of definition + repo.

ericsciple commented 6 years ago

@bryanmacfarlane it sounds like we may want to consider sharing the checkout folder too, or the entire build directory. Sharing the entire build directory introduces more challenges for multi-checkout, so I would rather not go that far. Note the cloud build scenario is also the entire build directory, although I never got an answer whether they need to share the entire build directory, or whether .git folder is sufficient.

ericsciple commented 6 years ago

@bergmeister how many megabytes are all of the checked-out files? (exclude the .git folder from the calculation)

bergmeister commented 6 years ago

@ericsciple I would need to check when I am back in the office but a bit around 1-2GB. But this issue is not about how big the repo is or how many build definitions I use, it is about the agent having multiple clones of the same repo for which I do not see a reason why that should be. Also there is not a built in setting to do a cleanup of the likes of git clean -dfx after each build, which is not a problem for me but could be a problem for someone with lots of small repos instead of one monolithic one.

ericsciple commented 6 years ago

@bergmeister I'm also curious how big the .git folder is. I do understand, ideally you want to reuse the same build directory across multiple definitions. Simply sharing the .git folder (the cloned repo) gets a lot of customers a long way. Sharing the entire build directory would need to be opt-in or use a tag to control which definitions shared the same folder or something. It would break compat if every existing build definition started sharing the entire build directory - other directories exist inside the build directory not just the sources folder.

The only way our infrastructure supports this today, is with the Don't sync sources option. You would also need to check the enable-scripts-access-to-oauth-token checkbox. The get sources step prints the command lines it runs (fetch/checkout). You would need to run similar command lines and the environment variable SYSTEM_ACCESSTOKEN contains the credential (masked as *** in the logs). The step could be wrapped up into a script in your repo, or a task group,, and reused across multiple definitions.

bergmeister commented 6 years ago

@ericsciple I do not see how sharing the same folder is breaking back-compat:

And even if there is a special/hairy case, then it should at least be a configurable option of the vsts build agent to have a centrally shared checkout, which would help many people.

ericsciple commented 6 years ago

I think we are saying similar things using different terminology. By build directory, I am referring to the directory specified by the AGENT_BUILDDIRECTORY variable:

AGENT_BUILDDIRECTORY=D:\a\1
BUILD_BINARIESDIRECTORY=D:\a\1\b
BUILD_SOURCESDIRECTORY=D:\a\1\s
COMMON_TESTRESULTSDIRECTORY=D:\a\1\TestResults
SYSTEM_ARTIFACTSDIRECTORY=D:\a\1\a
alexdrl commented 6 years ago

We're having the same problem with our repository. The code is in a Git repository, which has a size of 100 MB (including .git folder, which is 45 MB), the problem we encounter is that we have a lot (almost 50) builds that point to the same repository, which makes each agent have 5 GB only of source code, this multiplied by 8 agents in one machine, is 40 GB duplicated. We would be happy if the agent shared each sources repository folder, as long as the following assumption is met:

Yeah, as long as we keep it down to per agent, then we assure an agent runs one job at a time so no concurrency.

If the sources directory is shared between builds that map to the same Git repository, each agent will only have 100MB of source, having the artifacts in each build directory.

Also, to save some space, we needed to make a git clean -fdx command after each build, because as the sources folder is not shared, we had a lot of unused DLL files in agents that are not currently running builds, which will get removed when the agent runs the same build another time, something that could not happen in a long time.

As I am writing this post, I have tried to change the $(Build.SourcesDirectory) variable to $(Agent.WorkFolder)\s, with no luck, as it seems that is specified in SourceFolder.json. Is there any workaround that does not include the Don't sync sources option?

oskarm93 commented 6 years ago

We have the same issue in my company. We use private Azure VMs as build servers and many of them have to be sized up as bigger Dv2 series rather than BMS, because they require 100GB+ temporary drives for work directories. Clean-up steps are well and good, but seem like workarounds for agent inefficiency. When I look at agent's work folder, it usually contains ~15 folders with the same content, eating up space like its candy. Don't see a reason why not have only one folder per Git repo per agent.

jsheetzati commented 6 years ago

Also running into similar issues with a 1GB legacy git repo + multiple build definitions against the same repo. Shallow fetch helps but does not fully solve the problem.

alexdrl commented 6 years ago

Any timeline on this? Our agents are getting bigger and bigger with new build definitions, and growth of the repository.

glaenzesch commented 6 years ago

Hi πŸ‘‹

we had the same issue and solved this with the "Maintenance" tab of the agent pools. Important: This settings is only available at "collection level"

image

With this setting the TFS sends a maintenance job to the agent and he will clear up the working folder.

We use Team Foundation Server 2018.1 on-premise. Hope this helps πŸ˜‰

alexdrl commented 6 years ago

@glaenzesch This is a half solution, as the working directory will get filled with new folders, corresponding to each build definition, as builds get queued...

ppejovic commented 6 years ago

To get around this in our on-prem instance of TFS I've written a custom build task, that is typically added as the last step of the build, which will (if not previously done) move the repo into a shared location (if not already there) and then update the build sources directory path for the definition (in SourceFolder.json) with the new location. The next time the build runs it will be using the repo in the shared location.

The shared folder the repo is added to is a hash of the following (in ps):

"$agentId\$collectionId\$teamProject\$repository"

This means there is a repo shared per agent, so there are no concurrency issues. I'm sure fiddling SourceFolder.json isn't a supported scenario, but we have a huge repo to contend with and it's worked for us nicely so far.

bergmeister commented 6 years ago

@TingluoHuang Any progress/timeline on this?

alexdrl commented 6 years ago

@ppejovic As Microsoft does not seem to have this marked as urgent or with priority, could you share the code of the custom build task?

Thank you in advance.

littleninja commented 6 years ago

@alexdrl I tried writing my own script but ran into the problem of another process (the build/release?) using files in the build directory. We only just started using a free extension, Post Build Cleanup.

Helpful resources:

MichelZ commented 5 years ago

Would you mind updating us if this is still on the radar for the agent, and if you might be able to share a rough timeline? (2019H1? 2019? Next 3 years?) @TingluoHuang @ericsciple @bryanmacfarlane

alexdrl commented 5 years ago

This would be ideal. We "solved" this problem activating deduplication in Windows Server 2016, which in turns is a mess, because when the maintenance job of the agents is executed, the free space drops down to 0 and sometimes throws errors. We tried to modify the builds as @littleninja and @ppejovic said, but had no luck.

This is curious, if Microsoft is using this agent code, why this optimization is not implemented?

alexdrl commented 5 years ago

Is https://github.com/Microsoft/azure-pipelines-yaml/pull/113 going to help with the repository caching? Each build in each agent is not caching the repository cloning, which in turn is worse in terms of performance, because with each git clone, the build slows down a lot. We think that package restore caching is great, but improving repository download times (which affects also the people that does not use packages) is necessary.

bergmeister commented 5 years ago

The caching feature will not help my case (agents on Azure VMs) because as long as each build definition clones its own repository I still have to pay for the disk space of repositories that are lying around.

oskarm93 commented 5 years ago

Today I ran out of disk space on the build servers AGAIN. Using self-hosted build servers on Azure IaaS VMs. B4ms series with 32 GB temp drive. We have 8 build agents' work folders on the temp drive. Here's a visual representation of what my problem is:

image

Why do many agents have separate folders called 1,2,3 under them? I would understand if they stored different git repositories, so you don't have to clone it from scratch every time. But in my case they do not. All of these directories contain the same git repository. Even if I build different branches, I would want the same git folder to be used, and just different branches checked out there. This would be the first step to massively reduce duplication.

Interestingly enough, releases are already well behaved when it comes to cleaning up after themselves. A release definition, no matter where the artifacts are coming from, will always go into the same r1, r2 folders, and will clean up the contents before re-downloading required artifacts. This is why we never have to clean up our test machines, because they just cycle through the same amount of storage space.

Only then do we need to talk about caching - of _tool and _tasks folders. I understand that agents may be running under different user accounts, but if they don't - can the tools not be stored under the user profile folder instead? It would save my steps re-downloading .NET Core, NuGet, VSTest, Helm etc every time I clean the temp drive.

bergmeister commented 5 years ago

@xenalite I agree with you but the D series are much better in terms of how much temporary disk space you get for money (B4ms has only 32 GB for $128.71)

balchen commented 5 years ago

Fully support any action on this initiative. Our repo is 9 GB after it's been built and we currently have 22 build definitions running on the same repo. Sharing the repo would be a huge boost, even if each definitions has a separate build folder.

mtrobbin commented 5 years ago

We too are running into the same problems. If each build job would just reuse the same working dir and git repo then that would solve it. Instead we end up with a 10 min checkout on every new branch and we have to clean up old jobs too aggressively such every build is a new checkout.

I believe this is a standard feature in jenkins too...which makes this even more annoying.

ArcanoxDragon commented 5 years ago

I'm running into the same problem with the build server at my employer. We have 5 agents running on the same machine, and upwards of 20 build definitions for various subprojects/environments/etc, but only one repository. Our repo is more than 2 GB on a fresh clone, so our build server filled up very fast. I found out I was able to use directory junctions to temporarily alleviate this issue while a proper fix is being worked on. Our agents run in working directories C:\a0\_work through C:\a4\_work, so I used the following PowerShell script to set up directory junctions supporting 50 unique build definitions per agent:

foreach ($agent in 0..4) {
    $agentDir = "C:\a$agent"

    foreach ($buildId in 1..50) {
        $buildDir = "$agentDir\_work\$buildId";

        if ($buildId -eq 1) {
            if (!(Test-Path $buildDir)) {
                New-Item -ItemType Directory $buildDir
            }
        } else {
            if (Test-Path $buildDir) {
                Remove-Item -Force -Recurse $buildDir
            }

            New-Item -ItemType SymbolicLink $buildDir -Target "$agentDir\_work\1"
        }
    }
}

The source code is always cloned into the C:\a#\_work\1 folder because of the symbolic links. Each agent has its own unique copy of the repo, so concurrent builds still work with this setup (at least with the configuration we have, an agent will only be building one build definition at any given time).

(edit: I changed the script to use symbolic links; I was observing what I thought might be performance issues with directory junctions. I'm not positive on that, but symbolic links work just as well and I have not noticed issues with them)

AlberTajuelo commented 5 years ago

@bergmeister any time estimation about this? We are thinking to move back to Jenkins again if this problem still persists. :(

bergmeister commented 5 years ago

I do not work for Microsoft or this repository. Maybe @ericsciple, @bryanmacfarlane or @TingluoHuang can give some timeline about this issue, which is the most upvoted one in this repo.

AlberTajuelo commented 5 years ago

Thanks @bergmeister ! :)

Do you have any idea about when this issue could be resolved? @ericsciple

Thanks a lot!

ppejovic commented 5 years ago

I previously left a comment - https://github.com/microsoft/azure-pipelines-agent/issues/1506#issuecomment-395883867 - about how we have been using a custom build task to successfully stub the more efficient cloning behavior in our self-hosted windows pipeline agents for the past couple of years.

The build task has now been open sourced here:

https://github.com/OrbisInvestments/azure-pipelines-custom-tasks/

This task works if you are running an Windows agent version earlier than v2.149.2, which was released on Mar 29 this year.

v2.149.2 introduced this change - https://github.com/microsoft/azure-pipelines-agent/pull/2132 - which allows the clone path to be overridden by the user. Once I get more information on this change, and assuming MS doesn't do anything to address the actual issue in the meantime, the task can hopefully be updated to utilize this feature. Using the feature OOTB would mean users would have to manage individual clone paths but this is something the task could handle automatically.

Any feedback or comments on the above repo would be most welcome.

ADD-Juan-Perez commented 4 years ago

@ppejovic

Thank you very much for sharing the custom build task.

At the moment we can not use it because our agents are in the version 2.149.2.

Any news from Microsoft about this change #2132?

ppejovic commented 4 years ago

@juanperezADD I tried to see if I could use the feature introduced in #2132 but it's not going to work. The path override only allows you to relocate the checkout to a path under the agent build directory (e.g. c:\_work\1). That is still specific to each pipeline so again it leaves you with the situation where you are still cloning the repo for every pipeline. All that override seems to offer is the option to place the clone in a directory other than the default one named s.

mtrobbin commented 4 years ago

I was able to implement a workaround here. In our pipeline our first step sets up a sym link (mklink with /d option) of the BUILD_SOURCESDIRECTORY to a common location that already contains a cloned git repository. The second step in the pipeline does the git checkout operations. This has been working quite well after a couple of weeks and has massively reduced the checkout time and the disk space needed on the build servers.

ppejovic commented 4 years ago

@juanperezADD I've refactored the task to do something similar to what @mtrobbin is doing. I'll try to get it tested and released in the next day or two.

ppejovic commented 4 years ago

@juanperezADD I've refactored the task to do something similar to what @mtrobbin is doing. I'll try to get it tested and released in the next day or two.

I've released a pre-release version of the task that uses symlinking so it should be compatible with all agent versions now (I've tested with v2.155.1):

https://github.com/OrbisInvestments/azure-pipelines-custom-tasks/releases/tag/v1.0.0-preview

Symlinking requires the account the agent runs under to have admin privileges on the local machine. I believe PS v5 is also required for some of the cmdlets that now provide symlink info (i.e. Get-Item).

If you have an opportunity to test feedback is welcome. If you've already tried an earlier version of the task I'd recommend starting with a fresh agent working directory as changes are breaking.

It may be worth looking at providing this as a pipeline decorator so that it applies to all pipelines without having to add it explicitly as a step. PRs welcome!

balchen commented 4 years ago

Symlinking requires the account the agent runs under to have admin privileges on the local machine. I believe PS v5 is also required for some of the cmdlets that now provide symlink info (i.e. Get-Item).

PS >= v6 has been required the whole time since the script crashes on parsing its own JSON config file because of this bug: https://github.com/PowerShell/PowerShell/issues/3284

ADD-Juan-Perez commented 4 years ago

@ppejovic

The pre-release version works like a charm.

PS >= v6 has been required the whole time since the script crashes on parsing its own JSON config file because of this bug: PowerShell/PowerShell#3284

Our build machines have the PowerShell version 5.1.

Thank you very much for your work.

ppejovic commented 4 years ago

Symlinking requires the account the agent runs under to have admin privileges on the local machine. I believe PS v5 is also required for some of the cmdlets that now provide symlink info (i.e. Get-Item).

PS >= v6 has been required the whole time since the script crashes on parsing its own JSON config file because of this bug: PowerShell/PowerShell#3284

@balchen in our case all the testing and usage of this task has been with PS v5. We can continue this conversation on an issue on the repo if you want to raise one.

@juanperezADD Happy to hear it’s working for you!

balchen commented 4 years ago

@balchen in our case all the testing and usage of this task has been with PS v5. We can continue this conversation on an issue on the repo if you want to raise one.

@juanperezADD Happy to hear it’s working for you!

No need to raise an issue, the problem isn't with this repo. PS has a bug (referenced above) that caused the build task to fail on reading the configuration file using ConvertFrom-JSON (that it had previously written using ConvertTo-JSON). The issue was only present on our Windows 2012 build servers with old Powershell, and a Powershell upgrade resolved it.

But you're right. The upgrade was to PS v5, not to PS v6, so they must have included the bug fix in a maintenance release of v5 or something.

Either way, you're not going to run this build task on PS v4.

mgenware commented 4 years ago

We ran into this, i wrote a script and published it to npm so we can do something like "Delete the _work folder if free space is less then 50 GB".

Add a run command task at the end of your job:

# Delete the `_work` folder if free space is less than 50 GB
npx oh-my-disk@1 50gb "rm -rf $(agent.workfolder)" || true

It worked fine for a while. But recently, it triggered some new issues, some types of azure tasks automatically queues certain built-in tasks, in our case, the xcode build task queues a task called "Post-job ruby", which tries to touch the _work folder, and fails every time the script above runs.

japj commented 4 years ago

@mgenware you can make your own task that executes your oh-my-disk script in a postjobexecution. See also https://mitchdenny.com/cleaning-up-from-cancelled-vsts-builds/ The postjobexecutions should execute in reverse order of the tasks in the build definition

mgenware commented 4 years ago

Hi @japj , thanks for the information, that looks awesome! Sadly we're still using classic UI pipeline, does anyone know if postjobexecutions is possible on classic UI pipeline?

ppejovic commented 4 years ago

@mgenware postjobexecution is a behavior of a build task so it's possible in classic UI pipeline or YAML. You just need to write a custom task to package up your script.

This is exactly what our task does: https://github.com/OrbisInvestments/azure-pipelines-custom-tasks. This task takes a different approach to reducing the disk usage and works by sharing a single clone between pipelines that build from the same repository. This currently is a Windows only task due to a dependency on powershell.

balchen commented 4 years ago

We use OrbisInvestment's task (ported to Javascript to run cross-platform) with great success. We've combined it with a RAM disk on MacOS to speed up the build process, which helped tremendously. The RAM disk is of course destroyed when the machine restarts, and the repo needs to be cloned on the first subsequent build, but other than that first time, the combination of the two have dramatically decreased our build times and disk usage.

jespergustinmsft commented 4 years ago

It's been a while since there was any action here. Is there any work going on to address this issue?

We're also running into the same problem, (and more now) after splitting our build definition into multiple parts using the -template property in our yml files to improve performance.