Validate state of the test node to help adapters to give always expected information to the extensions/data consumers

microsoft / testfx

MSTest framework and adapter

MIT License

716 stars 253 forks source link

Validate state of the test node to help adapters to give always expected information to the extensions/data consumers #3759

Open MarcoRossignoli opened 1 week ago

MarcoRossignoli commented 1 week ago

Ref https://github.com/microsoft/testfx/issues/3478#issuecomment-2329659440

At the moment "the platform" is nothing more than a IPlatformOutputDevice that collects an print in console or in server mode sent states outside. We don't have validation on what adapters are sending. We could think to add a "strict" verification of the state of every test node at least from the perspective of the lifecycle expected by the platform "at minimum"(you can push what you want if other extensions understand your shared model, like trx), like "discovered(non mandatory) -> in-progress(not mandatory) -> executionstate(mandatory)" rejecting with an exception if you push out of this expected state. cc: @Evangelink

bradwilson commented 1 week ago

Just so that it's documented here, I raised the question of whether it's "illegal" (or, more specifically, troublesome) to the platform whether xUnit.net sent an "in-progress" message and then never got around to sending a result message, which in part inspired this issue.

Our generalized use case is: test framework (which is extensible) says "I'm starting to run this test", and then says "I've decided not to run this test" (aka, its result state is "NotRun"). This is contractually one of the result states that we support, in addition to the more traditional "Passed", "Failed", and "Skipped".

As the runner (which is separate from the test framework), we receive these messages at independent times: an ITestStarting, followed later by one of our result messages, and lastly by ITestFinished. We map ITestStarting to the in progress test node, and then we map the more traditional results (ITestPassed, ITestFailed, ITestSkipped) into the appropriate "execution state" nodes. There is nothing to map ITestNotRun onto in the Microsoft Testing Platform model. (The finished message is not used in this scenario.)

In practical terms, the "in-the-box test framework" (XunitTestFramework) designates when this situation will happen: if a test is marked as Explicit = true and we're running in ExplicitMode = off (the default), or if a test is marked as Explicit = false (the default) and we're running in ExplicitMode = only. But the runner cannot make these assumptions, because the test framework is a user extensibility point, and in fact it's only the test framework that knows anything about what defines a test at all. Our runner is, by design, as much in the dark about the inner workings of the test framework as Microsoft Testing Platform is in the dark about the inner workings of us.

So my original concern was: Is it an error condition to send Microsoft Testing Platform an "in progress" node for a test, and then never send the "execution state" node for that test?

The additional data point that made me ask that is this: Microsoft Testing Platform does not have a "not run" execution state, but the Test Explorer UI most definitely has a "not run" execution state (it's the state all tests are in before they've been run once). Using TPv2 today, we preserve this "not run" execution state successfully; using Microsoft Testing Platform, the only way for us to preserve the "not run" execution state is to simply not send Microsoft Testing Platform an "execution state" node. Sending any of the three legal "execution state" nodes would mark the test visually as either passed, failed, or skipped; none of those is technically correct. So we just don't send anything, in an attempt to preserve the proper Test Explorer UX.

Anyway, that's the background from the xUnit.net perspective.

Here is our command line experience vs. the MTP command line experience w/rt not run tests:

$ .\src\xunit.v3.assert.tests\bin\Debug\net472\xunit.v3.assert.tests.exe -explicit only
xUnit.net v3 In-Process Runner v0.3.0-pre.25-dev+759dee5cf8 (64-bit .NET Framework 4.8.9261.0)
  Discovering: xunit.v3.assert.tests (method display = ClassAndMethod, method display options = None)
  Discovered:  xunit.v3.assert.tests (912 test cases to be run)
  Starting:    xunit.v3.assert.tests (parallel test collections = on [24 threads], stop on fail = off, explicit = only, seed = 1091710143, culture = invariant)
  Finished:    xunit.v3.assert.tests
=== TEST EXECUTION SUMMARY ===
   xunit.v3.assert.tests  Total: 1056, Errors: 0, Failed: 0, Skipped: 0, Not Run: 1056, Time: 0.135s

$ .\src\xunit.v3.assert.tests\bin\Debug\net472\xunit.v3.assert.tests.exe --explicit only
xUnit.net v3 Microsoft.Testing.Platform Runner v0.3.0-pre.25-dev+759dee5cf8 (64-bit .NET Framework 4.8.9261.0)

Test run summary: Zero tests ran - src\xunit.v3.assert.tests\bin\Debug\net472\xunit.v3.assert.tests.exe (net472|x64)
  total: 0
  failed: 0
  succeeded: 0
  skipped: 0
  duration: 371ms

MarcoRossignoli commented 1 week ago

Thanks @bradwilson for the sample

nohwnd commented 1 week ago

I can come up with many arguments for and against this approach. So I am undecided. :)

Having a "strict" mode where we can account for the amount of tests is nice, but as you describe we don't know about how tests are being executed. We can only assume the most basic things like: a failed test means a failed run, but even that falls apart when we introduce test re-runs.

Similarly for tests bound to dynamic data (like query from a db), there can be different count per run. So this all only works if we would send list of tests to run, and expected getting all of them back, but that still has downsides (as we can see in VS Test Explorer that adopted this approach some time ago), you have to serialize and send all the test cases, account for the dynamic data tests that might not be the same as before, and account for tests that were added in the meantime.

bradwilson commented 1 week ago

Similarly for tests bound to dynamic data (like query from a db), there can be different count per run.

Unstable data sets are especially problematic for pre-enumerated data, so frequently commented on that I wrote a FAQ page about it. 😂 https://xunit.net/faq/theory-data-stability-in-vs

bradwilson commented 1 week ago

(as we can see in VS Test Explorer that adopted this approach some time ago)

Does this mean my page is no longer necessary? I haven't had anybody complain about unstable data sets in quite a while, so maybe the problem disappeared and I never noticed. 😂 Until TestDriven.NET decided to stop shipping versions, that was my preferred test runner, so my personal use of Test Explorer is fairly recent.

nohwnd commented 1 week ago

We believe there are two potential fixes for this. In our mind, the better choice would be for Visual Studio to use the "run just these tests" API when the user asks to "Run All", passing the list of previously discovered tests back to the test framework. This API already exists, and frameworks already have to implement it, so there should be no realistic downside to make this fix.

Yes that is how it currently works in VS with vstest, you send a full list of previously discovered tests to the runner to run them. In your case you serialize the parameters and recieve them back so tests run "as before", and re-running the tests will be deterministic, even though it will be outdated if the underlying source of data is dynamic.

Testing platform imho does not do this, and will re-run all with a filter. Preferring the data being up to date, rather than tests that TE already knows about being re-runnable.

bradwilson commented 6 days ago

The new filter will end up skipping new data as well as data that no longer exists, because having only the test case ID and not the serialized test case means we must perform a rediscovery, so unfortunately we will end up in the same scenario as “run all”: new data ignored (though at least not run), and old data that no longer exists will not be run despite ostensibly being in the requested filter.

MarcoRossignoli commented 6 days ago

Yes that is how it currently works in VS with vstest, you send a full list of previously discovered tests to the runner to run them. In your case you serialize the parameters and receive them back so tests run "as before", and re-running the tests will be deterministic, even though it will be outdated if the underlying source of data is dynamic.

At the moment we don't do it, we decided to rely only on the concept of test id and avoid all the roundtrip of serialized stuff in favor of "hash id" handled by the adapter for performance(serialization, deserialization, network and storage) and simplicity reason.

The adapter should provide unique id for what it decides is "a test". I also think that have "deterministic"(in the sense of freeze the parameters) run is not ideal, I change my tests and I do run all, I'd like to see all my tests to run also the new one, so that run all should really run all and update the ux accordingly. Rebuild should re-discover the new ids. It's on the adapter decide if fold unstable tests with common id or different id(hash based on args) and in case of run a no more available id the run will result in 0 run with the ux that should take it as "there's no test with this id anymore".

In our internal prototype we codegen the [ids,test] dictionary to be quick to find out single test(also if the balance between how much codegen you have to do+the build time+jitting is not always better than plain reflection where metadata anyway are already loaded for runtime needs). Let's see if it will be enough this model or we can think to something different.

bradwilson commented 5 days ago

I also think that have "deterministic"(in the sense of freeze the parameters) run is not ideal

That's an opinion, and not one that's shared by everybody. There are people who believe (and I am one of them) that if we discover 5 tests and then say "run all" with no changes to the source, then those 5 tests should run. In the past, that was not true, because Test Explorer would say "Run All" without context, and if the data set isn't stable, it's impossible to get those 5 tests again.

The only way to rationalize this was to tell people to turn off pre-enumeration of unstable datasets.

I change my tests and I do run all, I'd like to see all my tests to run also the new one, so that run all should really run all and update the ux accordingly

There is a UX difference between "I wrote a new test" or "I changed a test" vs. "I just built my source, I see all the tests, now run them". What you're describing is the former, and what my post was written about was the latter.

For your scenario, this is really up to Test Explorer to do the right thing here with "run all" by making tests that no longer exist disappear, and showing the results for new tests that it wasn't informed about ahead of time. As previously noted, simply doing a build + discover + run all isn't going to be sufficient when datasets aren't stable (not to mention the re-discover pass, aside from being ineffective, would also be a waste of time).

I will say that even now with TPv2, trying to use TestExplorer.RunAllTestsInContext to run a test I just wrote, or a theory where I changed/added/removed data, to be something that's very hit or miss: sometimes it's runs it correctly, sometimes it runs it partially, and sometimes it doesn't run it at all.

This is behavior we can't control. Test Explorer is the one that needs to behave better here.

Rebuild should re-discover the new ids.

It does. But with unstable datasets, discovery will hand you back something ephemeral: you can't run it unless you give me back the serialized data set. This part does at least work with TPv2, and I'm concerned that with MTP this is going to be a regression for people: they won't be able to look at a single data row test in Test Explorer and know for certain that they'll be able to run it unless they know the dataset is stable.

It's on the adapter decide if fold unstable tests with common id or different id(hash based on args)

xUnit.net does not (and cannot) know that data is unstable. Only the developer who writes the data source can know this. That's why we tell people to turn off pre-enumeration with unstable data because of the way Test Explorer has been broken with this in the past. It sounds like we will continue to have to make this recommendation because the design of MTP makes it impossible for us to present an unstable data row to Test Explorer and allow the user to run it.

MarcoRossignoli commented 5 days ago

That's an opinion, and not one that's shared by everybody. There are people who believe (and I am one of them) that if we discover 5 tests and then say "run all" with no changes to the source, then those 5 tests should run. In the past, that was not true, because Test Explorer would say "Run All" without context, and if the data set isn't stable, it's impossible to get those 5 tests again.

This is true, but is it correct? Suppose I have a database/csv I discover and I get 5 tests, I go to the datasource and I remove 2 and I run all and 5 will run. Is it correct to say that I'm running the correct set of expected tests? Why you think that the "set of the input" is not part of the tests? I see dataset and source at same level here, a change in one or other could break my tests assertions. I could remove one test and a bug manifest because with old set I did some mess with "statics"(suppose a code bug) and I'm affecting the other tests.

For your scenario, this is really up to Test Explorer to do the right thing here with "run all" by making tests that no longer exist disappear, and showing the results for new tests that it wasn't informed about ahead of time. As previously noted, simply doing a build + discover + run all isn't going to be sufficient when datasets aren't stable (not to mention the re-discover pass, aside from being ineffective, would also be a waste of time). ... This is behavior we can't control. Test Explorer is the one that needs to behave better here.

I think that the UX should behave better in this case, with different icons to notify that something changed. But run a "temporary" snapshot looks to me is not running what will run when I'll open the PR or in CI.

It does. But with unstable datasets, discovery will hand you back something ephemeral: you can't run it unless you give me back the serialized data set. This part does at least work with TPv2, and I'm concerned that with MTP this is going to be a regression for people: they won't be able to look at a single data row test in Test Explorer and know for certain that they'll be able to run it unless they know the dataset is stable.

I need to understand the "real gestures" that cause unstable datasets. Because if we're talking about a sample where I have a non static source like a csv or a database that I need to expand on the fly and I cannot know using simply reflection(or sources elements) ho many cases I have, my feeling is that the case when we're "unstable" is during the "development" of that specific test, until I've added all my tests I iterate on the discovery and run(and here like above the ux should be better). Or if the input is not stable when "committed" it's not clear to me what user expect to assert, should we assert deterministic case or?

xUnit.net does not (and cannot) know that data is unstable. Only the developer who writes the data source can know this. That's why we tell people to turn off pre-enumeration with unstable data because of the way Test Explorer has been broken with this in the past. It sounds like we will continue to have to make this recommendation because the design of MTP makes it impossible for us to present an unstable data row to Test Explorer and allow the user to run it.

Why the adapter cannot expand and generate the id from the hash(testcase+input)(send back the serialization of the input and deserialize back mimic the same behavior)? In this case the id will be always different or constant if the "input" is not changed. And in the same way during execution the id will be used to filter after expansion, or am I wrong? After it's fine it's a decision of the "adapter" how to do it, if fold everything in 1 test or not and both choices are legit imo, but I don't a limit related to identify testcase+input with an id.

bradwilson commented 5 days ago

my feeling is that the case when we're "unstable" is during the "development" of that specific test, until I've added all my tests I iterate on the discovery and run

I think you're conflating two things:

You manipulate the data source (in source code or the database) in order to introduce or modify test data during the development process.
You acquire data from a source which is inherently unpredictable (randomized, based on current date & time, based on the current state of the file system, based on a database which is changing outside the scope of the development process, etc.).

I have no argument that the first is something that should be expected and embraced. What I want is for Test Explorer to be improved to handle it better so that it doesn't require a manual build & discovery stage just to ensure that all the data rows will be correctly reported.

For the 2nd case, we already tell developers to disable theory pre-enumeration when they know the data is not reliably stable. This will be even more important with MTP than TPv2, because at least with TPv2 users could reliably run individual rows that were discovered; without test case serialization, that capability is no longer available.

I can verify that with MTP and 17.12 Preview 2, Test Explorer still behaves poorly in the face of unstable data sets. If the source code changes, it appears to re-run discovery before "Run All", but if the source does not change, it does not re-perform the discovery, and therefore leaves the outdated data row in place and does not report the new data row at all. That includes data that is random/unpredictable, and I also verified that if I read the data from a data file (marked as "copy if newer", and thus part of the compilation process).

So this goes back to why we tell people to stop pre-enumeration when they know the dataset may be unstable. If Test Explorer can be improved to handle this situation better, then it would certainly be better for users with unstable datasets (and we would be able to soften the language of the documentation page).

Why the adapter cannot expand and generate the id from the hash(testcase+input)(send back the serialization of the input and deserialize back mimic the same behavior)?

We do that. I'm not sure why you think we don't.

In this case the id will be always different or constant if the "input" is not changed.

Returning a stable ID in the fact of an unchanging data row is already how it works today (with both TPv2 and MTP).

And in the same way during execution the id will be used to filter after expansion, or am I wrong?

This filtering does work when the data has not changed, because the ID is stable. The problem is fundamentally one that has to be fixed in Test Explorer; we can't fix how it handles unexpected situations (like a test being removed and/or a test being added between discovery and execution).

it's a decision of the "adapter" how to do it, if fold everything in 1 test or not and both choices are legit imo

We let the user make that decision, because they're the only ones who can. Otherwise, the only thing we could do would be to disable pre-enumeration entirely, and our users would be very unhappy with this decision. 99%+ of users have stable datasets and we'd be punishing the vast majority of users for the sake of a few.

MarcoRossignoli commented 5 days ago

We do that. I'm not sure why you think we don't.

I need to do another check, I recall when I was running xunit tests suite that for one dataset tests I was running one test and the result were more than one with the same id, I thought was this case of folding, but maybe I've misread.

You acquire data from a source which is inherently unpredictable (randomized, based on current date & time, based on the current state of the file system, based on a database which is changing outside the scope of the development process, etc.).

Ok got it now what you mean, I never saw case like that in our codebases or in my past, I see it a bit antipattern and I've usually discouraged it, I would expect deterministic dataset values or it's hard not have flakiness or make assertions, but for now I think it's fine I don't see these cases as so frequent and if it will be a problem we can add the a specific "metadata" property that an adapter can add and we send back in an opaque way like today. Today we're paying that serialization/deserialization+networking+space in every case.

What I want is for Test Explorer to be improved

I agree on this, I'd like too.

bradwilson commented 4 days ago

I need to do another check, I recall when I was running xunit tests suite that for one dataset tests I was running one test and the result were more than one with the same id, I thought was this case of folding, but maybe I've misread.

Ah, you're talking about the case where we have multiple test results for a single test case. Yes, this most commonly happens when users disable theory pre-enumeration (or the data is not serializable). Because of our need to support TPv2, the serializability requirement still exists in v3, and as previously mentioned, we recommend users disable pre-enumeration for unstable datasets.

Even if we threw away TPv2 compatibility and removed the serializability requirement, we're still stuck (worse now, IMO) with the unstable dataset problem that mandates that we return a single test case that ends up with multiple test results.

I see it a bit antipattern and I've usually discouraged it

Personally, I agree, but we don't prevent our users from doing it. 😂