microsoft / azure-pipelines-tasks

Tasks for Azure Pipelines
https://aka.ms/tfbuild
MIT License
3.47k stars 2.6k forks source link

Visual Studio Tests just stop working (build still running) #7706

Closed LokiMidgard closed 5 years ago

LokiMidgard commented 6 years ago

Environment

TFS 2018 with private agent version 2.134.2 on Windows 7

Issue Description

Test execution of Visual Studio Test task just stops after some time. I run multiple test which also log on the local machine including the start and end of the test. When I cancled the test the last entry was around half an hour old. The test agent was still running printing 2018-07-08T13:55:53.5722684Z Test run '22639' is in 'InProgress' state with 'Total Tests' : 228 and 'Passed Tests' : 0..

I havn't seen an pettern when the tests stops, but I haven't seen anyone running to the end.

Error logs

TestRun.log

LokiMidgard commented 6 years ago

I think I at least found a workaround with batching of tests as sugestet in this issue.
I will most likly be able to tell if that worked after the weekend.

jayaranigarg commented 6 years ago

@LokiMidgard : Any updates on this?

LokiMidgard commented 6 years ago

Over the Weekend some tests run. One build running only on one machine completed, which never did before. But one test running on two different machines faild.

In the logs for the failing build, nothing new happens after this statement:

2018-07-13T13:30:29.9277477Z Test run '22647' is in 'InProgress' state with 'Total Tests' : 228 and 'Passed Tests' : 23.

only new statements with 'InProgress'. This goes on untill following lines that ends the tests:

2018-07-13T18:53:06.4120131Z Test run '22647' is in 'InProgress' state with 'Total Tests' : 228 and 'Passed Tests' : 23.
2018-07-13T18:53:37.5178340Z Test run '22647' is in 'Aborted' state with 'Total Tests' : 228 and 'Passed Tests' : 2
2018-07-13T18:54:43.4889856Z ##[error]Test run is aborted. Logging details of the run logs.
2018-07-13T18:54:43.7832536Z ##[error]New test run created.
2018-07-13T18:54:43.7833044Z Test Run queued for Project Collection Build Service (DefaultCollection).
2018-07-13T18:54:43.7833629Z Test Run queued for Project Collection Build Service (DefaultCollection).
2018-07-13T18:54:43.7833960Z ##[error]Test execution started. Test run id : 22647
2018-07-13T18:54:43.7834574Z ##[error]Test run timed out. Time out : 21600000 ms. Test run id : 22647
2018-07-13T18:54:43.7835437Z ##[error]Test run aborted. Test run id: 22647
2018-07-13T18:54:43.7836347Z ##[error]System.Exception: The test run was aborted, failing the task.

The same goes for the other agent on the seccond machine:

2018-07-13T14:36:27.7563044Z Test run '22646' is in 'InProgress' state with 'Total Tests' : 228 and 'Passed Tests' : 54.

to

2018-07-13T18:52:06.7037917Z Test run '22646' is in 'InProgress' state with 'Total Tests' : 228 and 'Passed Tests' : 54.
2018-07-13T18:52:36.7190738Z Test run '22646' is in 'InProgress' state with 'Total Tests' : 228 and 'Passed Tests' : 54.
2018-07-13T18:53:06.7497472Z Test run '22646' is in 'Aborted' state with 'Total Tests' : 228 and 'Passed Tests' : 54.
2018-07-13T18:53:37.4211162Z ##[error]Test run is aborted. Logging details of the run logs.
2018-07-13T18:53:37.4679906Z ##[error]New test run created.
2018-07-13T18:53:37.4679906Z Test Run queued for Project Collection Build Service (DefaultCollection).
2018-07-13T18:53:37.4679906Z Test Run queued for Project Collection Build Service (DefaultCollection).
2018-07-13T18:53:37.4679906Z ##[error]Test execution started. Test run id : 22646
2018-07-13T18:53:37.4679906Z ##[error]Test run timed out. Time out : 21600000 ms. Test run id : 22646
2018-07-13T18:53:37.4679906Z ##[error]Test run aborted. Test run id: 22646
2018-07-13T18:53:37.4836154Z ##[error]System.Exception: The test run was aborted, failing the task.

The build runns in 3 phases. The first Phase running on a general Testmachine restoring the vm's that will be testet to a snapshop. The seccond phase then run as multi-configuration phase on the agents deployed on the test vm's. In this phase the tests will stop Working.

I try to find out, if it fails always with the same test (when it fails). But unfortunatly the execution order is not deterministic and I only see which test did run after it run.

I'll try to find out more.

abhishkk commented 6 years ago

@LokiMidgard It seems like you are running codedui tests and one of the test is stuck. As the default build timeout is 6 hours, the task fails after 6 hours. Test which is getting stuck can be because of incorrect code or because of some UI popup blocking the test execution (or various other possible reasons.)

we can set test timeout so that the test which is getting stuck will fail once timeout reaches and thus allowing execution for remaining tests.

From logs, it seems like following is your test run environment:

  1. CodedUI tests are being ran.
  2. Tools installer package is used.
  3. Runsettings file is used. Within this runsettings there is an embedded testsettings file in MSTest node.

So you can set testTimeout in embedded testsettings. Example:

<?xml version="1.0" encoding="UTF-8"?>
<TestSettings name="TestSettings1" id="f6f04996-9fb1-440c-8263-6cd897c4de4a" xmlns="http://microsoft.com/schemas/VisualStudio/TeamTest/2010">
  <Description>These are default test settings for a local test run.</Description>
  <Deployment enabled="false" />
  <Execution>
    <Timeouts testTimeout="6000" />
  </Execution>
  <Properties />
</TestSettings>

On a separate note, we are deprecating direct and embedded testsettings soon. You can directly use all of settings of testsettings in runsettings under LegacySettings node. We added this support recently and is available in latest tools installer package. Please check this documentation for more details.

LokiMidgard commented 6 years ago

Thank you. I somehow assumed there was a default test timeout.

We are currently no longer using testsettings, because it will be deprecated. Through our build still created the file.

I tried to use the legacySettings but didn't get far, before finding this solution and adding following under the runsettings:

    <MSTestV2>
        <TestTimeout>270000</TestTimeout>
    </MSTestV2>

Now the tests should timeout after 4:30 min.

abhishkk commented 6 years ago

@LokiMidgard You are running CodedUI tests. CodedUI tests doesnt go via MSTestV2 adapter. They run via MSTest v1 adapter. Thus setting TestTimeout in MSTestV2 node should not work. You can try the solutions which I have mentioned above.

We are currently no longer using testsettings, because it will be deprecated. Through our build still created the file.

Logs shared by you suggests that yu are using embedded testsettings in runsettings (check runsettings > MSTest > SettingsFile node)

LokiMidgard commented 6 years ago

Well It did work.
Maybe because we are not using UITest Framework. But doing everything with AutomatisationElement. The <IsCodedUITest> element in the .csproj file is also specified with false.

Logs shared by you suggests that yu are using embedded testsettings in runsettings (check runsettings > MSTest > SettingsFile node)

The Output in the log is missleading. the start of the runsettings is omitted. So you do not see that this was commented out.

abhishkk commented 6 years ago

@LokiMidgard Got it. Glad to hear that setting test level timeout is working.

LokiMidgard commented 6 years ago

Unfortunatly this problem seems only to happen less often when batching the tests :( From the runsettings a test should timeout after 4:30 minutes, but it only prints the statusline of how many tests passed for houres untill the 4.5 houres timeout of the complete test are reached, if this problem occures.

I haven't yet found out, if actually a test is still executing or if this happens between tests and that's why testtimeout does not apply.

I also have tried using legecy settings, but this faild because the test assemblys are not found.

My runsettings look like this:

 <?xml version='1.0' encoding='utf-8'?>
 <!-- File name extension must be .runsettings -->
 <RunSettings>
     <!-- Configurations that affect the Test Framework -->
     <RunConfiguration>
         <!-- Path relative to solution directory -->
         <ResultsDirectory>.\TestResults</ResultsDirectory>

         <!-- [x86] | x64  
       - You can also change it from menu Test, Test Settings, Default Processor Architecture -->
         <TargetPlatform>x86</TargetPlatform>

         <!-- Framework35 | [Framework40] | Framework45 -->
         <TargetFrameworkVersion>Framework40</TargetFrameworkVersion>

         <TreatTestAdapterErrorsAsWarnings>false</TreatTestAdapterErrorsAsWarnings>
         <CollectSourceInformation>true</CollectSourceInformation>
         <TestAdaptersPaths>C:\drop\CustomDataCollectors</TestAdaptersPaths> 
         <!-- 0 = As many processes as possible, limited by number of cores on machine, 1 = Sequential (1 process), 2-> Given number of processes up to limit by number of cores on machine-->
        <MaxCpuCount>1</MaxCpuCount>
        <TestSessionTimeout>16200000</TestSessionTimeout>

    </RunConfiguration>

      <DataCollectionRunSettings>
    <DataCollectors>
<DataCollector uri='datacollector://microsoft/VideoRecorder/1.0' assemblyQualifiedName='Microsoft.VisualStudio.TestTools.DataCollection.VideoRecorder.VideoRecorderDataCollector, Microsoft.VisualStudio.TestTools.DataCollection.VideoRecorder, Version=15.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' friendlyName='Screen and Voice Recorder'>
          <Configuration>
            <MediaRecorder sendRecordedMediaForPassedTestCase='false' 
              xmlns='' />
          </Configuration>
        </DataCollector>
      <DataCollector friendlyName='LogFileCollector' uri='datacollector://SEEBURGER/LogFileCollector' />
    </DataCollectors>
  </DataCollectionRunSettings>

  <LegacySettings>
    <Execution>
      <Timeouts testTimeout='270000' />
    </Execution>
  </LegacySettings>

    <!-- MSTest adapter -->
    <MSTest>
        <ForcedLegacyMode>false</ForcedLegacyMode>
        <IgnoreTestImpact>true</IgnoreTestImpact>
        <MapInconclusiveToFailed>True</MapInconclusiveToFailed>
        <CaptureTraceOutput>false</CaptureTraceOutput>
        <DeleteDeploymentDirectoryAfterTestRunIsComplete>False</DeleteDeploymentDirectoryAfterTestRunIsComplete>
        <DeploymentEnabled>False</DeploymentEnabled>
    </MSTest>
    <MSTestV2>
        <TestTimeout>270000</TestTimeout>
    </MSTestV2>
</RunSettings>
prawalagarwal commented 6 years ago

@LokiMidgard Can you set system.debug as true in the build variables and provide us the entire logs zip? You can send it to devops_tools@microsoft.com

LokiMidgard commented 6 years ago

@prawalagarwal I've tried to send you the zip file via mail. But I recived an Delivery has failed to these recipients or groups:. I hope you still recived the files.

I also was able to force legacy mode but the result were the same.

prawalagarwal commented 6 years ago

@LokiMidgard yes got the logs. Will follow up there. Thanks.

ShreyasRmsft commented 6 years ago

@LokiMidgard can you please change the execution batch size to one and see which test is causing the hang? Also see if it gets stuck at the same tests each time? And also try logging into the agent machine when the test run hangs.

ShreyasRmsft commented 6 years ago

@LokiMidgard any updates on this?

LokiMidgard commented 6 years ago

I have the test currently running. I hope I have some data at the end of the day. The last time the test machines were restored before I could save the data from the machines. Sorry it takes so long.

LokiMidgard commented 6 years ago

I've tried to analize the log files of my tests.

All but two tests failing tests had the same log entry when they run into the problem. It was the last log entry in TestCleanup

The only statments after this log are:

        Thread.Sleep(5000); // screencam should record end of test

            if (outlookCrashed && TestContext.CurrentTestOutcome == UnitTestOutcome.Passed)
                Assert.Fail("Outlook crashed");

In one other Log the last test did log from 07.09.2018 12:13:37.4518to 07.09.2018 16:13:22.4748. I'm pretty sure thats an problem in our test that run infinite. But I wished the test timeout would work.

LokiMidgard commented 6 years ago

I'm not sure if the last information is helpfull. Running multiple builds with logging did blow the Content database untill no more space was available. So it could be possible, that the problem in this case was actually no more space.

I deleted the builds, if I understand it correctly, the database will purge all deleted files once per day. If this has freed the hard disk tomorrow I will start some new builds and look if they show the same result.

ShreyasRmsft commented 6 years ago

@LokiMidgard like you said seems like an issue with the test hanging but from our side we should be honoring the test case timeout. I'm working to see if i can reproduce the issue locally. Also if possible can you create a simple app where a test case hang happens and timeout does not occur with the same runsettings? Want to see if it is specific to your code or whether it is a generic issue.

LokiMidgard commented 6 years ago

I'm still tiding up the database, and shrinking takes some time.

I'm not sure if I can provide you with a Sample this week.

ShreyasRmsft commented 6 years ago

@LokiMidgard that's alright we'll try everything from our end to see if we can repro and identify the problem. If not we'll wait round till you are free enough to try out a repro. Thank you for your co-operation in trying to identify and fix the issue.

jayaranigarg commented 5 years ago

@LokiMidgard : Any updates here?

LokiMidgard commented 5 years ago

I have run multiple test over the weekend, with an minimal test that only opend Outlook and sends an mail. And Repeting that test around 200 times.

Up till now 24 test runs were started (the last ones should finish in the next hours). I have ony seen one time where the test did hang.

The next I wan't to try is to create tests that just have while(true) Thread.Sleep(30000); And check if that will prevent the test from abort. I also try to force the ui to freeze and see if the blocking FindFirst will result in the problem.

LokiMidgard commented 5 years ago

The build I started yesterday, seems to actually timeout regulary on all test machines. (still running, but I have multiple timeouts) So I assume that a "normal" Thread.Sleep is not able to reproduce this behaviour.

I will look at some other ways to block the code.

LokiMidgard commented 5 years ago

I noticed that there are a huge number of conhost and vstest.consle processes running. In one case thate where 197 and 189 processes. Is that expected?

grafik

LokiMidgard commented 5 years ago

When removing code I misconfigured the log file so I didn't notice directly, but on at least one Image where the tests are still running the last log entry is from 05:01:54.1532 (currently 9:17 AM on the image)

I need to reconfigure the the logger, so it will create better logs (again). Hope I will have more information later or in two days.

LokiMidgard commented 5 years ago

Still working on this, but didn't get more information.

LokiMidgard commented 5 years ago

Ok, I was able to reproduce it the problem on all my test images.

At least on problem I have that it seems I can get a timeout in ClassInitilize

[ClassInitialize()]
public static void MyClassInitialize(TestContext testContext)
{
    while (true)
        Thread.Sleep(Timeout.Infinite);
}
ShreyasRmsft commented 5 years ago

Hmm, this is very interesting. Lemme try it out.

LokiMidgard commented 5 years ago

I will try to wrap all code in Test/Class - Initilize/Cleanup in following snippet:

var cancel = new CancellationTokenSource();
cancel.CancelAfter(TimeSpan.FromMinutes(10));
var t = System.Threading.Tasks.Task.Run(() =>
{
    // Actual code
}, cancel.Token);
t.Wait(cancel.Token);
if (t.IsCanceled)
    throw new TaskCanceledException(t);

Hope that actually helps.

LokiMidgard commented 5 years ago

@ShreyasRmsft Any luck reproducing it?

ShreyasRmsft commented 5 years ago

@LokiMidgard got pulled in by some other work, didn't get time to try this out.

Will ask someone else to take a look.

vagisha-nidhi commented 5 years ago

Hi @LokiMidgard We are able to reproduce this on our end. We are tracking this as a separate issue regarding timeout in ClassInitialize here.

LokiMidgard commented 5 years ago

Ok

I close this issue then and subscripe to the other one.