timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-9104] Visual studio builds started by Jenkins fail with "Fatal error C1090" because mspdbsrv.exe gets killed #6408

Closed timja closed 6 years ago

timja commented 13 years ago

I run into errors when using a customized build system which uses Visual Studio's devenv.exe under the hood to compile VisualStudio 2005 projects (with VC++ compiler). When starting two parallel builds with Jenkins (on different code base) the second job will always fail with "Fatal error C1090: PDB API call failed, error code '23' : '(" in exactly the same second the first job finishes processing. Running both jobs outside Jenkins does not produce the error.
This has also been reported for builds executed by MSBuild on the Jenkins user mailing list [1].

I analysed this issue thoroughly and can track the problem down to the usage of mspdbsrv.exe. This program is automatically spawned when building a VisualStudio project. All Visual Studio instances normally share one common pdb-server which shutdown itself after a idle period (standard is 10 minutes). "It ensures access to .pdb files is properly serialized in parallel builds when multiple instances of the compiler try to access the same .pdb file" [2].
I assume that Jenkins does a clean up of its build environment when a automatically started job finishes (like as described at http://wiki.jenkins-ci.org/display/JENKINS/Aborting+a+build). I checked mspbsrv.exe with ProcessExplorer and the process indeed has a variable JENKINS_COOKIE/HUDSON_COOKIE set in its environment if started through Jenkins. Killing mspdbsrv.exe while projects are still connected will break compilation.

Jenkins mustn't kill mspdbsrv.exe to be able to build more than one Visual Studio project at the same time.


[1] http://jenkins.361315.n4.nabble.com/MSBuild-fatal-errors-when-build-triggered-by-timer-td385181.html
[2] http://social.msdn.microsoft.com/Forums/en-US/vcgeneral/thread/b1d1bceb-06b6-47ef-a0ea-23ea752e0c4f/


Originally reported by gordin, imported from: Visual studio builds started by Jenkins fail with "Fatal error C1090" because mspdbsrv.exe gets killed
  • assignee: danielweber
  • status: Resolved
  • priority: Major
  • resolution: Fixed
  • resolved: 2018-05-07T15:04:36+00:00
  • imported: 2022/01/10
timja commented 13 years ago

markewaite:

You might refer to JENKINS-3105 for a work around or alternative technique. I assume the process tree killer is what is killing the mspdbsrv.exe process, and disabling process tree killer for these jobs may avoid the problem.

timja commented 13 years ago

gordin:

Yes, you are right. Thanks a lot for pointing this out. I set BUILD_ID=dontKillMe globally in the Jenkins configuration and can confirm that mspdbsrv.exe is not killed any more which works around the problem.

timja commented 13 years ago

gordin:

changed priority to Minor because an easy work around is available

timja commented 12 years ago

chantivlad:

"Jenkins mustn't kill mspdbsrv.exe to be able to build more than one Visual Studio project at the same time."

Although i will try the indicated workaround, i think this would be good to have it as a feature somewhere, maybe in the node configuration?

timja commented 12 years ago

danielweber:

How about adding a configurable list of process names which shall not be killed by the "process tree killer"?
The workaround described above leaves mspdbsrv.exe running, but also any other "dangling" processes a build
might have left behind. I'd still like to use the process tree killer in general, it should just not kill
mspdbsrv.exe.

timja commented 12 years ago

fnx:

Visual Studio is so ubiquitous that I personally think this could warrant a special case built into Jenkins, but anyway this is what we have just started doing (before calling devenv but after the MSC setup.bat has been called) to work around this (for all Windows builds):

:: PITA to keep MSPDBSRV alive
set ORIG_BUILD_ID=%BUILD_ID%
set BUILD_ID=DoNotKillMe
start mspdbsrv -start -spawn
set BUILD_ID=%ORIG_BUILD_ID%
set ORIG_BUILD_ID=

It seems, from reading around, that the Jenkins process “tree” killer rummages through the whole process tree looking for processes with the environment variable BUILD_ID set to what Jenkins originally set it to, and killing any it finds. The above temporarily changes that to something else (anything, you could even just blank it), launches mspdbsrv manually (so it has the altered env. var.) and then puts it back (to restore the usual no-resource-leak fix).

When you run mspdbsrv, it seems to immediately exit if there's already an appropriate version running, so it doesn't matter having multiple jobs trying to kick it off.

timja commented 12 years ago

danielweber:

We adopted Neil's workaround, but found a small issue. The problem is that checking if
mspdbsrv is running does not reset its timeout. Consider this scenario where the compiler
is first used 5 min after build start:

00:00 job A starts -> mspdbsrv not yet running -> mspdbsrv is started with the changed BUILD_ID
01:00 job A finishes (mspdbsrv default timeout of 10 min starts here)
01:07 job B starts -> mspdbsrv still running -> mspdbsrv is not restarted
01:10 mspdbsrv timeout elapses, shuts down
01:12 job B tries to use mspdbsrv for the first time. As mspdbsrv is not running, it starts a new
instance with unchanged BUILD_ID, which is what we wanted to avoid in the first place.

I suggest to change the build script as follows

start mspdbsrv -start -spawn -shutdowntime 2147483647

This sets the shutdown time to the max value of 2^31-1. This timeout (~68 years) should be long enough

timja commented 12 years ago

gordin:

I would not recommend setting the shutdown time to such a high value. mspdbsrv.exe tends to hang after some time because of handle leaks (at least the version shipped with Visual Studio 2005) causing the compiler to exit with error when trying to access pdb files. mspdbsrv process must be killed manually, then.

timja commented 12 years ago

yimin:

Is it possible to add the workaround to the next Jenkins release and solve the problem later?

timja commented 12 years ago

peteboyrocket:

Seconded Yimin Li's comment...

timja commented 11 years ago

laro:

Still happening with Visual Studio 2010 (under Windows 7), BTW.

timja commented 11 years ago

sweavo:

Just want to add my own "me too" to this. (Visual Studio 2010, Win7x64, MSBUILD plugin) My thanks to the thread contributors so far for saving me a lot of detective work.

timja commented 11 years ago

lukast_dev:

Just want to add my own "me too" to this. I see failed build with fatal error C1090: PDB API call failed, error code '23'
Visual Studio 2008, Windows Server 2003, MSBuild plugin, Jenkins 1.527
I tried to use workaround with setting BUILD_ID to dontKillMe, but that did not help

timja commented 10 years ago

sweavo:

A (mostly unsatisfactory) workaround for me has been to use the Throttle Concurrent Builds plugin and make all MSBUILD projects be members of the same category, with a concurrency limit of 1. This means that my builds are no longer failing for arbitrary reasons, but it means that jobs like build_release_config_and_run_10_hour_integration_tests block the build_head_on_branch_x_and barf_if_unit_tests_fail jobs.

timja commented 10 years ago

sweavo:

Hi all,

I've written a python script that basically implements reference counting and resetting of timeouts as a wrapper around MSPDBSRV.EXE.

I'm using this locally as of today and would love it if others could try it, improve it, give feedback, etc.

https://github.com/sweavo/pdbsrv_srv

I have an Execute Shell build step before my MSBUILD build step that contains:

set -e
# We must be in the script's dir because it may try to execute itself again
cd Tools/bin
python pdbsrv_srv.py -l ../../pdbsrv.log & 
echo pdbsrv_srv is pid $!
sleep 5

Note that the path supplied to -l should not be in the workspace because a lock will be held after the end of the job, and a subsequent build might then fail to delete the workspace if required.

timja commented 10 years ago

leedega:

We have been using Jenkins LTS edition for over a year now without error. We recently updated to v1.532.3, also without error. However last night we just upgraded to v1.554.1 to get a couple of minor bug fixes and now we are experiencing this issue constantly. We have a fairly large CI farm with a dozen agents and ~700 jobs, and we are getting this new compilation error across the board. This suggests that whatever is causing this issue was somehow introduced in the v1.554.1 update.

That being said, I've noticed the comment thread above predates this release, so perhaps the problem was only affecting the latest non-LTS edition until now. Maybe this can help isolate the problem further.

timja commented 10 years ago

leedega:

Also, I can confirm that the source of this bug is most likely caused by the fact that something is killing the mspdbsrv.exe service while it is in use. We were able to reproduce this problem a long time ago when we first adopted Visual Studio 2008, well before we started using Jenkins. The way we reproduced the problem was independent from any tool, as follows:

1. Set up a background process / system service to run as a local user profile which will perform the build. Lets call this user 'MyUser'. This may be a scheduled task, a Jenkins agent, or an number of other service-oriented processes available on Windows.
2. Log in to the system locally using the MyUser profile.
3. Launch a compile using the background process
4. Open Process Explorer or Task Manager and look at the processes launched by your active user. You'll see mspdbsrv.exe in that list
5. Log the MyUser user out of the system
6. The background compilation will fail

Cause: when the background process is launched Visual Studio will spawn an independent process for mspdbsrv.exe, which is apparently used to synchronize file accesses across parallel builds. When the user profile associated with this background service is also logged in to the local machine this new process will be launched in the active users thread space. So then, when this local user logs out from the local machine this thread is terminated, causing any other processes (such as those which continue doing compilation in the background) to fail because they depend on this service.

At the end of the day this is just a horrible design flaw in Visual Studio which has been in place since the introduction of their multi-threaded builds many years ago, and from what I've read on the forums it is considered a "feature by design" and is not expected to be fixed - ever. Consequently, the workaround we decided upon at the time was just to adopt the convention of never logging out of the user profile associated with our automated builds. In this way we avoid accidental termination of this critical process.

So now enter Jenkins. If what I have read is true and Jenkins tries to be smart by scanning the agent systems for "rogue" threads after the completion of each build and terminates them at will, then I have even more cause for concern. Those concerns aside, assuming this is being done for a good reason, I concur with an earlier commend that recommended that this termination logic have an explicit exception to prevent killing this particular process. Given that the information I have gathered and stated above is correct, this seems to be the only reasonable solution to this problem. Finally, if I am correct and this 'bug' was just introduced by the latest update to the LTS edition then conceivable it should be easy to isolate and promptly fix.

timja commented 10 years ago

leedega:

I also just noticed that the severity of this issue has been set to "minor" however I would recommend increasing it to "production stopped". In our case, with ~700 jobs spread across a dozen powerful build servers, this bug is causing dozens of superfluous build failures per hour making this latest update to the LTS edition completely unusable.

timja commented 10 years ago

danielbeck:

Aren't you able to launch that service manually instead of having it launched by the first build to come along?

timja commented 10 years ago

danielbeck:

Recursive process killing on Windows was added between versions 1.16 (1.532.x) and 1.19 (1.554.x) of that library, see here.

Workaround could be to block loading of winp in some way.

timja commented 10 years ago

gordin:

I set the severity to minor because an easy workaround is available. What is the reason you can't use "BUILD_ID=dontKillMe" environment variable? This disables the process killer for the job (or globally if set for the whole Jenkins instance). Generally I think the process killer is a good thing, but normally it shouldn't be needed.

timja commented 10 years ago

danielbeck:

Christoph: Have you tried this on Windows with winp doing the killing? I think it works differently there.

timja commented 10 years ago

gordin:

sorry, no I haven't tried this with winp (to be honest, I don't even know what winp is). I will try with an current Jenkins setup.

timja commented 10 years ago

danielbeck:

Winp in the library doing the recursive killing on Windows (if available), and that is what was fixed between 1.532.x and 1.554.x – so Kevin is correct that this was changed between these versions.

There's a few reported issues related to winp not working reliably, maybe one of them can be exploited as a workaround to prevent it from killing pspdbsrv.

timja commented 10 years ago

sweavo:

"Aren't you able to launch that service manually instead of having it launched by the first build to come along?"

This is covered in the comments. The service times out, which can still happen mid-build. If you set the timeout long, then you risk memory leaks.

timja commented 10 years ago

gordin:

Setting the BUILD_ID to "dontKillMe" still works as expected with Jenkins 1.554.1 LTS. Even though I'm not able to test the original set up (as I don't use VS and mspdbsrv.exe any longer) a new process spawned during run with python subprocess.Popen() will not be killed by the process tree killer. Running without setting the BUILD_ID will kill the subprocess as expected.

timja commented 10 years ago

danielbeck:

Great! In that case, this is not a defect but behaves as intended.

What would be a good location to document setting BUILD_ID to prevent process killing? Obviously, there's a need there...

timja commented 10 years ago

gordin:

It is already documented at https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller

timja commented 10 years ago

danielbeck:

Random wiki pages aren't exactly discoverable. Unless you know it's there you wouldn't even bother searching.

Maybe add to the description of shell/batch build steps that launched processes are cleaned up after the script exits, and that this can be disabled?

timja commented 10 years ago

gordin:

I don't know if the build step is the right scope for the documentation. The descriptions of the different build steps are provided by the plug-ins, aren't they? Would be hard to have a consistent message across all plug-ins. At least python, msbuild, shell, windows batch come to my mind. Maybe also groovy, qmake, cmake and others that provide an api to spawn a process.
In order to solve this issue it would be nice to have some sort of process name white-list with processes that will never be killed by the tree killer. This could then be configured globally (master/per slave). What do you think?

timja commented 10 years ago

danielbeck:

Christoph: Shell and Batch are both in core and the most straightforward choices for launching programs. Specialist plugins might not even allow this much flexibility.

I'd just like to get more discoverability, and that doesn't need annotating at every conceivable location. If users knows this solution from Batch/Shell descriptions, and maybe transfer it over to similar plugin-provided builders, that's perfect.


Tree killer configuration might be helpful, but that should be filed in a new issue. AFAICT this needs to touch a lot of parts, so this would be a rather large project.

timja commented 10 years ago

leedega:

I have a few follow up questions:
1. If I understand correctly, this "process tree killer" feature was pre-existing in earlier Jenkins releases, but only in the latest update was it "changed" to add recursive killing of processes, correct?

2. That being the case, does setting "BUILD_ID=dontKillMe" disable termination of all processes or just this new "recursive" behavior? If it disables all process terminations I'd say this proposal would not be a viable workaround since it could risk leaving other rogue processes orphaned on a build machine, which has many adverse side effects (which, I'm guessing you already know since I suspect this feature was implemented to resolve these exact problems)

3. Won't setting the "BUILD_ID=dontKillMe" affect other parts of the build? The BUILD_ID env var is used as a unique identifier throughout the job after all. Changing it from the unique identifier it is meant to be, to a statically defined character string seems fragile at best.

timja commented 10 years ago

leedega:

So far, based on the recent comment threads, my admittedly superficial understanding of the root cause, and some quick Googling, it seems there are only a few viable options to resolve this issue:

1. A python script was written by an earlier commenter, which leverages the BUILD_ID env var to strategically control the lifetime of the pdbsrv process itself without affecting other parts of the build.

2. Roll back the version of this "process tree killer" used by Jenkins LTS to v1.16, before this new "recursive" behavior was added according to an earlier comment.

3. Provide some kind of workaround within the "process tree killer" or the Jenkins core libraries to compensate for this newly discovered problem.

4. Accept the fact that Visual Studio users will likely never use Jenkins version that include this "new feature", forcing them to use versions of Jenkins that predate this change.

timja commented 10 years ago

leedega:

Aside
I probably should say that I truly believe the real root cause of this problem is an underlying architectural issue with Visual Studio and it's use of this pdbsrv process in their newer compilers, but numerous forums and bug reports to Microsoft appear to fall on deaf ears (ie: they claim it's working this way by design). Given the fact that this has been a problem in Visual Studio for several releases spread across many years it's unlikely to change any time soon, so you may be forced to compensate for it here in your tool. To do otherwise will simply make it more difficult (and, by extension, less likely) for Visual Studio users to adopt / continue using your tool.

timja commented 10 years ago

danielbeck:

Does this also happen with MSBuild, or only Devenv? Can you switch to the former? What about systems without Visual Studio installed, instead using only MSBuild/Windows SDK?

(I'm not too familiar with Visual Studio projects beyond pressing an F-key to build them, so this might well be a stupid question)

timja commented 10 years ago

leedega:

From what I understand this is a problem with the compiler, which I think is the same compiler used under the hood by both msbuild and devenv, however I have not confirmed first hand the same problems arise in both situations. I'd be surprised if they didn't.

As for building our projects without Visual Studio, with just MSBuild / Windows SDK, we have as of yet been unable to do so. We have heavy dependencies on MFC which hasn't, until recently, been available outside of Visual Studio. Plus we have had numerous technical issues migrating to the newer versions of the SDK / MSBuild that do include them. Regardless, again I'd be surprised if any of this made any difference unless the compiler that ships with the SDK is fundamentally architecturally different than the one that ships with VS.

If I can spare some time to confirm a few of these details I'll let you know, even if just for curiosities sake.

timja commented 10 years ago

sweavo:

Solution 5: Don't run MsBuild projects in parallel.

Before I built the python workaround, that's what I did using a throttling plugin. Works fine. pdbsrv gets killed at the end of each build, and started afresh by the microsoft toolchain on the next job. But if you are trying to do continuous build on development branches, then this won't have the capacity to keep up.

Solution 6: Set BUILD_ID to hide pdbsrv from the processtreekiller. Live with the chance that once in a while pdbsrv might time out mid-build.

timja commented 10 years ago

leedega:

Solution 5: Don't run MsBuild projects in parallel.

That may be fine for small projects but not for larger ones. For example, our main codebase is configured with about 40 jobs per configuration to build each "tier" or "module" in our codebase more efficiently - running jobs in parallel whenever possible. Doing so reduced our "clean" build times from 14 hours to 3. Numbers like that are hard to argue against.

Are there other ways we could achieve similar results? Possibly, but they all require time and effort (aka: money) which we do not have.

Solution 6: Set BUILD_ID to hide pdbsrv from the processtreekiller. Live with the chance that once in a while pdbsrv might time out mid-build.

Could you clarify what you are referring to here? I assume you mean something other than using your python script since that was the very first "potential fix" I had mentioned above.

It has been my experience that so long as you leave Visual Studio to it's own internal details to manage pdbsrv it works reliably for extended periods, keeping the service alive when needed and terminating it safely when it isn't, even if you run multiple builds in parallel via Jenkins. In fact that is what we do now and it never causes problems with our builds. This is saying something considering the size and scale of our build farm, with hundreds of jobs spread across nearly a dozen servers, all running 24/7!

timja commented 10 years ago

danielbeck:

Maybe the following workaround would work: If mspdbsrv.exe runs as the user launching devenv, you could create a whole bunch of slaves all running on the same machine, but as different users, each having a single executor.

timja commented 10 years ago

leedega:

Seems a bit heavy. The extra overhead of running multiple agents alone seems like it would be significant, let alone the complexities involved with having multiple user profiles being used, all of which would need to have a consistent configuration to ensure the agents all behave the same, not to mention managing security and permissions and whatnot. Given that each of our agents currently runs with between 4 and 6 executors, that would increase our agent count by the same factor.

Also, this would make managing overall load on a given system more complex. Consider jobs that are configured to use 100% of the agents resources to prevent parallel build problems, as an example. These would need to be configured to work across agents somehow. I'm not even sure that is possible....

timja commented 10 years ago

zorbathut:

I looked into the difficulty of adding a "process whitelist" for processes that must not be killed. It would require some changes to winp but it's the only workable solution, besides "disable process killing for this entire task", which can, itself, cause build failures.

Unfortunately, because the necessary changes have to span two projects, it'll be a bit of a large task without cooperation from everyone involved.

> It has been my experience that so long as you leave Visual Studio to it's own internal details to manage pdbsrv it works reliably for extended periods, keeping the service alive when needed and terminating it safely when it isn't, even if you run multiple builds in parallel via Jenkins. In fact that is what we do now and it never causes problems with our builds. This is saying something considering the size and scale of our build farm, with hundreds of jobs spread across nearly a dozen servers, all running 24/7!

Unfortunately I've found this isn't the case - there seem to be situations where mspdbsrv times out mid-build and is restarted cleanly, and if that doesn't happen within a BUILD_ID replacement block, then when the restarting build finishes, Jenkins will happily kill mspdbsrv and break other builds.

I suspect "running 24/7" is why you're not seeing this - it's happening somewhat frequently on a much smaller farm of mine with much fewer jobs.

timja commented 10 years ago

leedega:

I suspect "running 24/7" is why you're not seeing this - it's happening somewhat frequently on a much smaller farm of mine with much fewer jobs.

That is totally possible. Running so many jobs in parallel so often it is probably a rare condition that no jobs are running at all on any given server on our farm, and this may be preventing the service from timing out.

Thanks for pointing that out.

timja commented 10 years ago

ajomaa:

I very new to the Jenkins world. I am running into this issue a lot. This would be a show stopper for us when it comes to adopting Jenkins for our build processes. Our builds get manually triggered by many users at random times. We could have 20 or more builds running at the same time; all running in parallel. I tried the Python script given by Steve Carter in a Execute Shell command box but I get an error about some "sh" -ex was not found! what gives? I thought I am running a Python script not Linux? or do they both need to run Linux?

In short, if I do not get this resolved, we will have to go back to our previous way of building.
Has anyone solved this issue yet?

Thank you,

timja commented 10 years ago

danielbeck:

Tony: Please address requests for assistance to the jenkinsci-users mailing list, or #jenkins IRC channel on Freenode.

timja commented 10 years ago

kerrhome:

I just ran into this one for the first time as far as I can tell. I did a quick look back and see no other instances and I don't recall seeing this before. For now, I'll take no action. danielbeck or anyone else, please let me know if I can provide you with any information that could help in resolving this. Build env where we saw this error: MS Win 7 x64, VS2010

timja commented 10 years ago

kerrhome:

I hit three more instances of this. Two yesterday and one other a week ago.

timja commented 10 years ago

danielbeck:

How are you starting these builds? Batch? MSBuild plugin? What exact commands? If batch, did you try setting BUILD_ID as described on https://wiki.jenkins-ci.org/display/JENKINS/ProcessTreeKiller ?

timja commented 10 years ago

kerrhome:

Batch. In the Jenkins project, we use "Execute Windows Batch Command" to call a batch script that automates a bunch of pre build work and ends up calling the builds via devenv.

I did not try the BUILD_ID suggestion as I saw that there were still issues mentioned in this ticket with this work-around. I was trying hang in there until the final solution was provided, but the failures seem to be picking up for us lately. I guess we'll use this work around for now.

timja commented 10 years ago

kerrhome:

I'm am trying the BUILD_ID suggestion now, but this is a hack (right?) and not a final solution? The final solution is to have Jenkins not kill specified jobs like mspdbsrv.exe. Whether that is in a whitelist managed by the user or hardcoded by Jenkins for now, doesn't matter to me. Hopefully there will be a long-term fix to Jenkins for this.

timja commented 10 years ago

delboyjay:

Does anyone know if there is an option to stop Jenkins from killing processes completely as a global option instead of having to add the BUILD_ID to every single job? I have tried adding this as an env variable at the node level but it doesn't appear to give the desired results (same PDB errors were still occurring), maybe I'm doing something wrong or misunderstanding how this is working under the hood?

We were running CruiseControl for years and never had this problem but we did however have issues where processes were not terminating properly and builds would run forever until someone intervened. Sometimes we still get this with Jenkins so from my point of view one problem is better than two so I'd rather just have an option to tell Jenkins not to force terminate anything - ever. If this cannot be done with the current version (ours is 1.566) Can we at least add a check box that says "Do not auto-terminate processes" as an option in a future release and let the user decide?