Issues with the representation of test hierarchies, result semantics and runtime errors

kernelci / kernelci-core

Core KernelCI tools

https://kernelci.org

GNU Lesser General Public License v2.1

108 stars 98 forks source link

Issues with the representation of test hierarchies, result semantics and runtime errors #2455

Closed r-c-n closed 5 months ago

r-c-n commented 8 months ago

Introduction

One of the most urgent problems to address before we move on with the next features, is the standardization and stabilization of the current API models, definitions and behaviors. Since there are a few tests running now we could get the impression that things are working and that we can move forward and continue developing additional features, but upon deeper inspection of the results, and from the perspective of a real user analyzing and building tools on top of them, it's clear that some things need to be refined.

After some analysis I managed to sum the problems in two parts: modeling of test results and test organization as hierarchies, and modeling of runtime and test fixture errors.

Modeling of test results and test organization as hierarchies

In the current models, there's only one type of Node (Test) to represent a test result. This Test type is used throughout a test hierarchy, that is, the same model is used for "test suite" nodes, "test case" nodes and everything in between.

However, we're finding that we need to model certain things differently depending on the type of node. Specifically, the "result" of a Test node might mean different things depending on whether the node is a "test case" or a "test suite" (and we haven't decided anything about intermediate nodes).

Test cases

This is the most basic case. These will be the leaf nodes of a test node hierarchy, the individual test parts of a suite that can pass or fail (or allow for other result values: skip, etc).

From a user, developer perspective, and for regression detection purposes, these are the results we're interested in. Everything above this level is only a way to group tests together. In other words, for a developer, it's meaningful that a particular test case failed, but it may not be very meaningful that a test suite failed (what does it mean that a test suite failed?).

So the "result" field in these nodes is pretty clear right now. It translates directly to the result of the test case.

There's an exception to this: certain test suites may define test cases that work as sanity checks or fixtures. If these test cases fail it could mean that the test suite failed to run properly or that some requirement wasn't met. For our purposes, we can consider these as test cases to keep the same test suite semantics and translate them to KernelCI directly without transformations.

Test suites

Test suites are collections of test cases that share the same general purpose, testing a single functionality, subsystem or feature.

In KernelCI they're currently being modeled as separate test nodes.

Problems and questions

Test suite nodes and test case nodes are modeled using the same Node type

However, we don't use them for the same purpose, and the meaning of the "result" field could be interpreted differently:

in a test case: the result of the test case
in a test suite: whether the test suite could run completely or not. For instance, if there was a runtime problem that stopped the suite and left some test cases pending to run.

Besides, we're currently storing certain artifacts in the test suite node (such as the test logs) but not in the test cases. Which means that when a user wants to extract information from a test case, he/she also needs to fetch the parent node to get this additional test run data. This isn't well defined.

Solution A

Don't encode test suites as nodes, ie. flatten the hierarchy. All nodes contain a "group" field that can be used to encode the name of the test suite or group, so we could have only test cases in a single level, instead of in a tree structure. All test nodes would then have the full test artifacts. Example:

Instead of having a test suite node like this:

  {
    "id": "65f165aa2330610bb234a70c",
    "kind": "test",
    "name": "kselftest-dt",
    "path": [
      "checkout",
      "kbuild-gcc-10-arm64",
      "kselftest-dt"
    ],
    "group": "kselftest-dt",
    "parent": "65f15f8a9dad8ca7ceb3e2a2",
    "state": "done",
    "result": "pass",
    "artifacts": {
      "lava_log": "https://kciapistagingstorage1.file.core.windows.net/staging/kselftest-dt-65f165aa2330610bb234a70c/log.txt?sv=2022-11-02&ss=f&srt=sco&sp=r&se=2024-10-17T19:19:12Z&st=2023-10-17T11:19:12Z&spr=https&sig=sLmFlvZHXRrZsSGubsDUIvTiv%2BtzgDq6vALfkrtWnv8%3D"
    },
    "data": {
      "error_code": null,
       ...

and a test case node like this:

  {
    "id": "65f166c92330610bb234aa40",
    "kind": "test",
    "name": "kselftest-dt",
    "path": [
      "checkout",
      "kbuild-gcc-10-arm64",
      "kselftest-dt",
      "kselftest-dt"
    ],
    "group": "kselftest-dt",
    "parent": "65f165aa2330610bb234a70c",
    "state": "done",
    "result": "pass",
    "artifacts": null,
    "data": {
      "error_code": null,
      ...

We could drop the test suite node and have only test cases that contain the full information:

  {
      "id": "65f166c92330610bb234aa40",
      "kind": "test",
      "name": "kselftest-dt",
      "path": [
        "checkout",
        "kbuild-gcc-10-arm64",
        "kselftest-dt"
      ],
      "group": "kselftest-dt",
      "parent": "65f165aa2330610bb234a70c",
      "state": "done",
      "result": "pass",
      "artifacts": {
        "lava_log": "https://kciapistagingstorage1.file.core.windows.net/staging/kselftest-dt-65f165aa2330610bb234a70c/log.txt?sv=2022-11-02&ss=f&srt=sco&sp=r&se=2024-10-17T19:19:12Z&st=2023-10-17T11:19:12Z&spr=https&sig=sLmFlvZHXRrZsSGubsDUIvTiv%2BtzgDq6vALfkrtWnv8%3D"
      },
      "data": {
        "error_code": null,
        ...

Note that we'd have to also encode the possibility of the test case being skipped or not-run because of a missing requirement or a runtime error external to the test. See "Modeling of runtime and fixture errors".

Pros:

Simplified node structure
More straightforward encoding of the same information
Multi-level node hierarchies can be flattened to one single level by encoding the nesting in the "group" field using separators, eg. "suite_a.group_b".

Cons: (none that I can think of)

Unknowns:

I'm not sure how hard it'd be to fit these changes in the current runtimes

The expressiveness of this representation, from the perspective of a user querying the api is the same as the current implementation.

Solution B

Introduce an additional Node type for test suites. This would solve the current problem of searching for test results without getting the results of test suite nodes, and would also give us more freedom to define specific fields for suites and test cases, with different meanings if necesary.

Pros:

Should be easy to adapt the current implementation to this
Cleaner and better defined models

Cons:

Not a general solution (what if a test suite has subgroups?)

Question: can we have all test case nodes created when a test suite runs regardless of whether they got to run or not?

If a test suite failed and some tests didn't get to run, or if some tests were skipped because some precondition wasn't met, having them in the KernelCI DB with the appropriate result value (SKIPPED, NOT-RUN) is valuable information. We can use that for analysis, checking for patterns and finding other problems.

Modeling of runtime and fixture errors

For tests that run on lava, test suites include an additional test case "login" that's not part of the test suite but a representation of the "login" stage of the lava job. If this test case fails, it means that the kernel failed to boot and reach a command line prompt and the tests didn't get to run.

So, for test suites that ran in lava and where the tests didn't get to run, there'll be only one child node in the end ("login"), with state=done and result=fail. In order to check for runtime errors in a suite like this (for instance, "tast-basic-x86-pineview") we can search for nodes where:

name=tast-basic-x86-pineview and group=tast-basic-x86-pineview: these two together mean that the test node represents a test suite (But this isn't consistent for all tests! see for instance kselftest-dt tests: https://staging.kernelci.org:9000/viewer?node_id=65f166c92330610bb234aa40)

and

state=done

and

result=fail and/or there's a failed "login" child node (this is an additional separate query). result=fail in a test suite means that the suite didn't get to run completely, so there will be missing test cases.

For certain tests, there'll be additional test case nodes representing parts of the test pre-requirements, or fixtures (test setup steps) that aren't really test cases. If these fail, again, that means that the test suite couldn't run correctly and there'll be test cases that didn't get to run. Example: os-release and tast-tarball in tast tests: https://staging.kernelci.org:9000/viewer?search=parent%3D65f0f3879dad8ca7ceb3dc54

All of this is currently being modeled by adding test nodes to a test suite hierarchy, which brings two problems:

it mixes the meaning and interpretation of the results: when searching for test results, it's much harder to do any kind of automatic filtering and reasoning, since for every test result we need to check which kind of test it is and, depending on that, do additional searches to check the results of neighbor nodes. This also makes it almost impossible, or at least inviable, to collect certain type of statistics (for instance, how frequently did a certain test fail because of a test suite setup error?) and patterns.
It makes it both harder and less efficient to check and collect results (see the example above), and the process can potentially get more complicated as more test suites with specific semantics are added to the system.

Proposed solution

Encode the information about runtime errors or test suite errors (preconditions, failed fixtures, etc) as Test fields. We're currently using two fields: "error_code" and "error_msg" to do some of this, but the way it's used isn't well defined. We could split this into multiple fields and fill them in in a well specified way:

kernelci_error & kernelci_msg: for errors caused or related to KernelCI infrastructure
runtime_error & runtime_msg: for errors detected at the runtime stage. For instance, lava errors (login, boot, rootfs, network, etc)
test_suite_error & test_suite_msg: for errors related to the test suite. For instance, failed precondition.

r-c-n commented 8 months ago

@nuclearcat , @a-wai , @JenySadadia , @pawiecz I'd like to know your opinion on this. Any comments and corrections are welcome. Sorry about the lenght, I couldn't synthesize it more. I think the "Solution A" listed above and the "Proposed solution" at the bottom could solve the problems if they're viable, but let me know if I'm leaving out important details.

Meanwhile I can try to implement them.

nfraprado commented 8 months ago

Hey, thanks for working on this. Some thoughts from me:

"In other words, for a developer, it's meaningful that a particular test case failed, but it may not be very meaningful that a test suite failed": IMO the value of having the test suite result correspond to the collection of results from its test cases comes down to an easy way to see if any test case failed for the platform, in particular for example if a newly added test case in a suite fails, none of the test cases would be regressing, but the test suite would.
KTAP allows nesting tests so as you mentioned even if the structure is flattened the groups still need to encode the hierarchy
I feel like removing the test suite node would make it impossible to detect some test suite / infrastructure failures. What if the login test from LAVA succeeded, but at some point before printing the first KTAP lines in the output (so possibly before even deciding how many and which tests will be run) an error happens? How would you identify the error and where would you store it if there aren't any incomplete tests besides the test suite?
"can we have all test case nodes created when a test suite runs regardless of whether they got to run or not?" Similarly, I'm not sure how this could be achieved given many times the tests that will be run are decided at run time.

r-c-n commented 8 months ago

Thanks @nfraprado those are very insightful comments and you make some valid points:

IMO the value of having the test suite result correspond to the collection of results from its test cases comes down to an easy way to see if any test case failed for the platform, in particular for example if a newly added test case in a suite fails, none of the test cases would be regressing, but the test suite would.

Good point, I hadn't thought of that use case, and I don't know if there'd be a different way to encode it.

About changing the meaning of "result" in a test suite or group, if we made test suites = pass if all the test cases pass and fail if any of the test cases fail we could generate a coarser-grained test suite regression, but then we'd also have the individual regressions generated in the failed test suites, and we'd have to make a distinction between the two types so that users won't have both types of results mixed (I think it'd be confusing). As long as we can differentiate between "leaf" nodes and "intermediate" nodes easily I don't think this would be a problem. But other people in the team may have a different opinion and reasons.

KTAP allows nesting tests so as you mentioned even if the structure is flattened the groups still need to encode the hierarchy

This shouldn't be a problem, a "path" string for every node (we already have that more or less) can encode the same thing just as well. It's not like we have "children" pointers anyway, only parent ones. So even if we store the full tree of nodes, browsing it top to bottom is a costly operation. In other words, what I want to say is that for some uses you don't really need an explicit tree-based data structure to encode a tree-shaped object. I'm not saying we should ditch the node hierarchies, but for some areas encoding everything as an explicit tree might be overkill, considering the type of operations we run on them (kbuild stages, probably). Nevermind, don't pay too much attention to my ramblings.

I feel like removing the test suite node would make it impossible to detect some test suite / infrastructure failures. What if the login test from LAVA succeeded, but at some point before printing the first KTAP lines in the output (so possibly before even deciding how many and which tests will be run) an error happens? How would you identify the error and where would you store it if there aren't any incomplete tests besides the test suite?

"can we have all test case nodes created when a test suite runs regardless of whether they got to run or not?" Similarly, I'm not sure how this could be achieved given many times the tests that will be run are decided at run time.

Aha, these are the key points. If the set of test cases run isn't known ahead of time, then that's the end of this issue. However, OPEN QUESTION: does that mean that we won't be able to keep track of test cases that went from "pass" or "fail" to "not run" or "skipped" because the test suite won't run them and, therefore, KernelCI won't create a node for them?

nfraprado commented 8 months ago

@hardboprobot

does that mean that we won't be able to keep track of test cases that went from "pass" or "fail" to "not run" or "skipped" because the test suite won't run them and, therefore, KernelCI won't create a node for them?

So, at least in the case of the kselftests using the KTAP format (newer tests should be using it), as part of the initial output, the number of tests expected to run is printed:

KTAP version 1
1..5

So you could potentially populate nodes for each one as soon as this line is printed, though using just a number as the test's name since the actual names aren't known at this point. This would allow you to detect crashes in the middle of a run. (maybe the kselftest runner or the TAP parser in LAVA already detects this and uses it for the test suite result, not sure) Still this isn't completely failproof, if some dependency of the test is missing it might output instead

KTAP version 1
1..0

and early exit. In that case you don't know how many tests were supposed to run if the dependencies were met. And this at least assumes the test suite is failing gracefully. If the test suite crashes before even printing those lines, then you might not even noticed it crashed. Still this failure should be tracked somewhere, and the logic place for me would be in a test suite node.

In my view, we'll probably need something to track results from test suites, notice when tests are missing in a newer run compared to a previous one, and notify that to the interested parties as a warning. I wouldn't call it an error, because it could be that the test suite was changed to no longer have that test. IIRC Mark Brown has something like in his lab. Sorry for the ramble, hopefully this makes sense and helps somehow.

laura-nao commented 8 months ago

About changing the meaning of "result" in a test suite or group, if we made test suites = pass if all the test cases pass and fail if any of the test cases fail we could generate a coarser-grained test suite regression, but then we'd also have the individual regressions generated in the failed test suites, and we'd have to make a distinction between the two types so that users won't have both types of results mixed (I think it'd be confusing).

My 2c here - I feel like part of the issues described above may be caused by the fact that the test suite result is tied to the LAVA job status, and not to the actual test suite result. By separating the concepts of job status (complete/incomplete/canceled) and test suite result (pass/fail), we could potentially avoid reporting regressions on tests that were not even run because the job finished even before generating the test cases (e.g. due to infra errors or test definition errors). However, this relies on the assumption that LAVA test suites fail when at least one test case failed, which is not the case for every test (see e.g. this job, where the 1_kselftest-dt suite is marked as pass even though its shardfile-dt case failed). I think this could be something to potentially be fixed on the LAVA side (i.e. test definitions/scripts run by the test), instead of doing that in KernelCI.

Regardless of whether the test suite result is determined straight from the LAVA results or by KernelCI, I agree there should be a distinction between test suite and test case regressions. Regressions on individual test suites should be reported along with its test cases regressions, but there could be individual test cases regressions without a matching test suite regression (if the suite was already failing when the test case regressed)..so I believe the user should be able to distinguish between them.

However, OPEN QUESTION: does that mean that we won't be able to keep track of test cases that went from "pass" or "fail" to "not run" or "skipped" because the test suite won't run them and, therefore, KernelCI won't create a node for them?

Not all tests have a plan in advance, so I also think the only way to track missing test suites/cases is to build a reference from previous jobs.

JenySadadia commented 8 months ago

Don't encode test suites as nodes, ie. flatten the hierarchy. All nodes contain a "group" field that can be used to encode the name of the test suite or group, so we could have only test cases in a single level, instead of in a tree structure. All test nodes would then have the full test artifacts.

I agree with what @nfraprado has pointed out and I also think we should keep node hierarchy as only single-level nodes would add more complications. We already had a discussion in the open hours about solution B. And I think adding a node kind e.g. job for nodes having sub-nodes will be useful to interpret test results for test suite and test case differently. With this design, we will interpret test suite completed with "pass" result and incompleted otherwise. The issue I see with this design is that regression tracker will need to traverse upto the leaf node as test suite result wouldn't be useful for detecting test case regressions.

Another option is to not introduce any new node kind and propagate test results to the parent node i.e. test suite. So, if one of the test cases is failed, a test suite will have the result "fail". We can not use LAVA test suite result as we already have observed and discussed issues with demsg and tast test suite results. @laura-nao has also pointed out the same thing. And then we would be able to rely on the result of the test suite for test case regressions. But then how to detect if the test suite job completed or not?

However, this relies on the assumption that LAVA test suites fail when at least one test case failed, which is not the case for every test (see e.g. this job, where the 1_kselftest-dt suite is marked as pass even though its shardfile-dt case failed). I think this could be something to potentially be fixed on the LAVA side (i.e. test definitions/scripts run by the test), instead of doing that in KernelCI.

Maybe LAVA also set the test suite result as "pass" for completed jobs and isn't based on child results? If so, we can use this result value to detect regressions in test suites.

what if a test suite has subgroups? I think that won't create a problem until we propagate the result to the parent suite node for every group.

Encode the information about runtime errors or test suite errors (preconditions, failed fixtures, etc) as Test fields. We're currently using two fields: "error_code" and "error_msg" to do some of this, but the way it's used isn't well defined. We could split this into multiple fields and fill them in in a well specified way:

Instead of introducing sub-fields, we can add different error codes to cover all the kinds of errors including infra and runtime errors.

r-c-n commented 8 months ago

Thanks for pitching in @JenySadadia

Instead of introducing sub-fields, we can add different error codes to cover all the kinds of errors including infra and runtime errors.

That's fine as long as the error codes can be easily differentiated or grouped by type, otherwise we'd be losing information. When searching for test results, it'd be useful to differentiate errors that happened in the runtime (lava) from actual test suite errors or other kernelci-specific errors. A simple error code prefix scheme could work for this but we'll have to put some thought into it in order to make it future-proof.

Another option is to not introduce any new node kind and propagate test results to the parent node i.e. test suite. [...] But then how to detect if the test suite job completed or not?

The error codes scheme proposed above could be used to encode this scenario if done correctly.

JenySadadia commented 8 months ago

Another option is to not introduce any new node kind and propagate test results to the parent node i.e. test suite. [...] But then how to detect if the test suite job completed or not?

The error codes scheme proposed above could be used to encode this scenario if done correctly.

Another issue with this approach is test suite regression will be detected based on child node failures instead of job status. I am not sure if that's the correct way as we won't be able to take job status into account for regression tracking.

r-c-n commented 8 months ago

Another issue with this approach is test suite regression will be detected based on child node failures instead of job status. I am not sure if that's the correct way as we won't be able to take job status into account for regression tracking.

I think the correct encoding of runtime and other errors external to the tests in the test suite nodes is also a solution to this. If done properly it should be able to encode that information, so even if the test suite result is based on the results of its children nodes, you can then check the test suite node for runtime errors. If there are none, then the test suite failed because a test case failed. If the test suite node contains additional information in the form of an error code and error message, then you can tell the suite failed because of some cause external to the tests.

r-c-n commented 8 months ago

Here's a concrete use case where it was deemed necessary to change the semantics of "result" for test suites to "fail if any test case failed": https://github.com/kernelci/kernelci-core/pull/2479 (WIP)

a-wai commented 8 months ago

I thought I had already commented here, but apparently not, so here are my thoughts:

A test suite (i.e. a node with child test nodes) should report "fail" if any of its children fails
Parent test nodes should reflect the test suite results (so also "fail" if any of its children fails), but only if the test suite actually ran, otherwise they should have state "incomplete"

Regarding the latter, one way of detecting those could be the following: for test suites containing multiple sub-tests, we can agree on a standardized prepare node (for Tast tests, it would group the current tast-tarball and os-release tests, which would then be children nodes of prepare). The logic would be the following:

if a test node contains a child node named prepare, then it's a test suite
if it has no child node apart from prepare, then that means the tests didn't actually run, and we can classify the problem as an infrastructure error (and therefore mark the higher-level test node as incomplete)
if a test node has no children, then it's a leaf test node and its result should be used to compute the parent node's state

In the case of LAVA jobs, which is our main use-case for now:

jobs with LAVA status "Incomplete" should be "incomplete" in KCI, as this implies the job couldn't execute all tests; that can be considered an "infrastructure error" IMHO
jobs which show as "Complete" in LAVA could be "pass" or "fail" depending on child nodes

Maybe LAVA also set the test suite result as "pass" for completed jobs and isn't based on child results? If so, we can use this result value to detect regressions in test suites.

At least for Tast tests, the "test suite" result is the exit code of the script: if all commands exit normally (exit code 0) then the "suite" is pass, so it's the responsibility of the script to error out if we want to ensure the test suite is marked as failed (which is basically the purpose of #2479 ).

a-wai commented 8 months ago

One thing I just noticed: in LAVA, tests can be organized in "test sets", that is, a single test providing several sub-tests; in such cases, the "test set" appears as a node in KCI, with each of the sub-tests being a child node of it.

However, test sets don't have a "result" of their own, so those nodes appear in KCI with null as their result. Calculating this field based on the results of children nodes would definitely make sense here and would be more meaningful.

Example node

r-c-n commented 8 months ago

@a-wai thanks for the ideas, some comments:

One thing I just noticed: in LAVA, tests can be organized in "test sets", that is, a single test providing several sub-tests; in such cases, the "test set" appears as a node in KCI, with each of the sub-tests being a child node of it.

However, test sets don't have a "result" of their own, so those nodes appear in KCI with null as their result. Calculating this field based on the results of children nodes would definitely make sense here and would be more meaningful.

Yes, that was one of the sources of inconsistencies. That solution would be better than what we have today, although I really wonder if we really need those intermediate nodes for test sets at all. Maybe the node structure of a LAVA test result doesn't need to be translated one-to-one to KernelCI nodes, although I might be missing some important details. In any case, any solution that is consistent across all cases works for me.

Consider this filthy hack I proposed, where the "path" of a test node would look something like this: checkout/kbuild/test_suite_x/test_set_y/test_case_z. Right now, every element of the path would have its own node. What if test_set_y wasn't a separate node?

Regarding the latter, one way of detecting those could be the following: for test suites containing multiple sub-tests, we can agree on a standardized prepare node (for Tast tests, it would group the current tast-tarball and os-release tests, which would then be children nodes of prepare). The logic would be the following:

If we finally decide to keep everything encoded into nodes, then having a dedicated node to keep the results of test fixtures sounds good.

if a test node contains a child node named prepare, then it's a test suite

FWIW, the problem with this approach is that, while the information encoded in the node tree can express that (which node is a leaf and which one is a test suite), we can't encode that in a query because our query format is quite limited and won't let us run compound statements. We can't ask the API for "test nodes which contains a child node named in a certain way`.

With the PR I linked above, in an intermediate node (suite or test set), the path would look like this: /checkout/kbuild_xyz/test_suite_X/, while a leaf node wouldn't have the trailing slash. This could fit the bill because we can use this on queries: "give me the test nodes whose path doesn't end on a trailing backlash" -> this gives you leaf nodes (test cases) exclusively.

Question: How hard would it be to implement this test grouping for test setup tasks and all the related logic?

a-wai commented 8 months ago

Yes, that was one of the sources of inconsistencies. That solution would be better than what we have today, although I really wonder if we really need those intermediate nodes for test sets at all. Maybe the node structure of a LAVA test result doesn't need to be translated one-to-one to KernelCI nodes, although I might be missing some important details. In any case, any solution that is consistent across all cases works for me.

I think we should keep the hierarchy, including test sets, because they represent not only LAVA test results, but also actual tests/sub-tests relationships. As an example, performance tests under ChromeOS are generally made up of multiple sub-tests, and I'd rather keep such information.

Consider this filthy hack I proposed, where the "path" of a test node would look something like this: checkout/kbuild/test_suite_x/test_set_y/test_case_z. Right now, every element of the path would have its own node. What if test_set_y wasn't a separate node?

Ah, I think I get the point. As long as it doesn't involve huge code changes and/or weird special-casing, and we can still easily check the results for all sub-tests of a test set, I wouldn't object having test sets being only represented in the children path. There is indeed little value in having separate node for those

Question: How hard would it be to implement this test grouping for test setup tasks and all the related logic?

For Tast tests, it would be fairly easy I think. Other cases may or may not involve more hairpulling.

FWIW, the problem with this approach is that, while the information encoded in the node tree can express that (which node is a leaf and which one is a test suite), we can't encode that in a query because our query format is quite limited and won't let us run compound statements. We can't ask the API for "test nodes which contains a child node named in a certain way`.

This brings the question of the query format, and its (lack of?) abstraction of database queries: is there a way we can improve this situation, or would it be too complex?

a-wai commented 8 months ago

Another thing is that this discussion mixes up 2 very different problems:

stored data and their meaning
data presented to users, through a viewer/dashboard etc...

I know at least I keep confusing those and come up with ideas that are important for the latter, but could be generated from data stored with other assumptions.

For example, I want to view test sets/suites as a single item, which is marked as failed if any of the subtests fails, and have the ability to then look at each subtests as child items; but as shown by the above discussion about encoding test sets in the children path field, this doesn't mean we must store test sets as separate nodes, only that whatever viewer we create/use should show them that way.

With the above in mind, I'd tend to back off from the concept of altering the parent node's result based on the children's ones for stored data. However, I still believe the viewer/dashboard should do so; but we can then add more possibilities:

if only the setup tasks are executed, then show the parent node as "Incomplete"
if actual tests are executed and all pass, then show the parent node's result as "Pass"
if all tests (except setup tasks) fail, then show the parent node's result as "Fail"
if at least one test fails but at least one passes, then show the parent node's result as "Partial fail"

The interpretation (pass/fail/partial) could be left to the viewer/dashboard, as long as the stored data contains enough information.

nuclearcat commented 8 months ago

Possible scenario (almost happened on kselftest when tappy dependency was missing) Let's say we have testset-networking, which execute: 1)Test-eth0 for existence of eth0 2)Test-wifi0 for existence of wifi0

Particular device have wifi0 missing. Let's say If we inherit and modify status of testset-networking if children fail - and set as fail. (which i think bad idea) So result will look like: testset-networking: fail testset-networking/test-eth0: pass testset-networking/test-wifi0: fail

We updated the testset and added to testset-networking third test: 3)Test-traffcapture that use libpcap and capture random networking traffic.

But by some reason lipcap missing in rootfs, so testset-networking failed to execute new test, but it is not particular test failure (so it wont trigger kernel regression), so it wont be reported as test failure. We still have: So result will look like... same! testset-networking: fail testset-networking/test-eth0: pass testset-networking/test-wifi0: fail

While if we dont inherit fail from children, we can identify problems of testset.

So on my opinion testset "testset-networking" should have it's own pass/fail indicators, which indicate if whole testset executed properly, or have problems (missing dependencies, failed non-critical device initialization etc).

P.S. Lava have incomplete status ONLY for jobs, unfortunately tests, and test sets can have either test of fail.

r-c-n commented 8 months ago

While if we dont inherit fail from children, we can identify problems of testset.

So on my opinion testset "testset-networking" should have it's own pass/fail indicators, which indicate if whole testset executed properly, or have problems (missing dependencies, failed non-critical device initialization etc).

This concern has been raised and then clarified many times during the discussion, but I'll make the same clarification again: no one is asking to discard the information about whether the test ran properly, what's being considered is to encode that information in a different way than as a "pass/fail" value in the result field.

A few alternatives have been proposed. What do you think of @a-wai 's idea of having a "pass/fail/incomplete" result? Wouldn't that work for the scenario you described?

nuclearcat commented 7 months ago

@a-wai , ping :)

a-wai commented 7 months ago

Nothing to add to what I wrote above:

With the above in mind, I'd tend to back off from the concept of altering the parent node's result based on the children's ones for stored data. However, I still believe the viewer/dashboard should do so; but we can then add more possibilities:

if only the setup tasks are executed, then show the parent node as "Incomplete"

if actual tests are executed and all pass, then show the parent node's result as "Pass"

if all tests (except setup tasks) fail, then show the parent node's result as "Fail"

if at least one test fails but at least one passes, then show the parent node's result as "Partial fail"

The interpretation (pass/fail/partial) could be left to the viewer/dashboard, as long as the stored data contains enough information.

nfraprado commented 7 months ago

To add one example to hopefully guide the changes, in this baseline test run the test failed due to infrastructure error, so an error_code was added to its node, however the child nodes, which represent the login and kernelmsg test cases, didn't get such properties. When looking over the results I would like to be able to filter out those, but currently that'd require traversing the node tree to the top to find out if any parent node had an infra error (which seems too costly).

r-c-n commented 6 months ago

To add one example to hopefully guide the changes, in this baseline test run the test failed due to infrastructure error, so an error_code was added to its node, however the child nodes, which represent the login and kernelmsg test cases, didn't get such properties. When looking over the results I would like to be able to filter out those, but currently that'd require traversing the node tree to the top to find out if any parent node had an infra error (which seems too costly).

r-c-n commented 6 months ago

A few more comments from other cases and discussions where this kind of problems (specifically the representation of node hierarchies and their result semantics) are sprouting as well:

From https://github.com/kernelci/kernelci-pipeline/pull/550:

According to Nikolai, generally, only leaf node results (test_suite.test_group.test_case) are sent to KCIDB. Intermediate test nodes in the hierarchy (in this case, "test_suite" and "test_group") are "virtual" tests in the sense that they don't necessarily translate to a real result. Instead, their results are generated as a summary of the results of its child nodes: they hold the "worst" result in their subtree. This means that if "test_case" failed, then both "test_group" and "test_suite" are marked as failed too. This is different than what we currently do in KernelCI, where intermediate nodes may have their own results independent from their children.
An example of how different this scheme is compared to what we currently have in KernelCI is this tast test, where an intermediate tast node failed but:
- 1: The cause of the failure is a runtime error reported by Lava but the error_code and error_msg are only stored in the parent node, not in this one.
- 2: Both children nodes are marked as "pass"
- 3: Only the children nodes (passed) are submitted to KCIDB but not this one because it isn't a leaf node
- 4: In KCIDB, this node would have result = pass because all its children nodes passed

The scheme followed by KCIDB is similar to the flattened-hierarchy solution proposed at the top of the discussion, with the same advantages (simplicity) and limitations (loss of information in some scenarios) as discussed above. I'm bringing @spbnick into the thread if only so that he's aware of the discussion and the different points that were brought up, particularly about the potential pitfalls of having the result of a parent node depending on the results of their children.

In any case, if we'r ever set to address these issues, here's a summary of the problems:

Defining how test hierarchies are structured and stored
Defining what a result means in a test suite or group
Decoupling the structure of Lava results from KernelCI results
Defining how to translate these semantics to KCIDB so that the results remain consistent and we don't lose critical information in the process

nuclearcat commented 6 months ago

As an idea, maybe we can submit some of the data we dont want to lose in misc field? https://github.com/kernelci/kcidb/blob/main/doc/submitter_guide.md#extra-data

If you have some data you'd like to provide developers with, but the schema doesn't accommodate it, put it as arbitrary JSON under the misc field, which is supported for every object. Then contact KCIDB developers with your requirements.

r-c-n commented 6 months ago

As an idea, maybe we can submit some of the data we dont want to lose in misc field? https://github.com/kernelci/kcidb/blob/main/doc/submitter_guide.md#extra-data
If you have some data you'd like to provide developers with, but the schema doesn't accommodate it, put it as arbitrary JSON under the misc field, which is supported for every object. Then contact KCIDB developers with your requirements.

Yes, but that is out of the scope of this discussion and more related to https://github.com/kernelci/kernelci-pipeline/pull/550. I mentioned KCIDB above as an example but the topic we're discussing has nothing to do with KCIDB, so I suggest we focus on KernelCI first.

JenySadadia commented 6 months ago

Hello, I went through the whole discussion again and here is what I'd like to suggest based on the ideas proposed above:

Keep the test hierarchy as is to preserve test suite execution failure information. It means to drop the idea of flattening the test hierarchy.
Introduce a test suite to hold nodes for preconditions such as login(for baseline) or os-release (for test tests) and name it setup/pre-checks/prepare (suggestions are welcome). On a separate note, this would also fix the issue I am facing related to KCIDB integration. We would be able to drop the submission of such test nodes to KCIDB if required.

if only the setup tasks are executed, then show the parent node as "Incomplete" if actual tests are executed and all pass, then show the parent node's result as "Pass" if all tests (except setup tasks) fail, then show the parent node's result as "Fail" if at least one test fails but at least one passes, then show the parent node's result as "Partial fail"
I liked @a-wai's idea here for updating test suite state/result based on the below scenarios. We can also consider introducing kind=job for top-level parent test suites i.e. the ones introduced by KernelCI in the test definition such as dmesg, tast, kselftest-dt, and setup test suite to distinguish it from any intermedia test sets. Atm we are storing test suite results from LAVA's callback data. We can drop it and use our own parsing as above proposed. As @hardboprobot proposed, we'd check for error code/message from LAVA and store it in the test suite node fields.

r-c-n commented 6 months ago

We would be able to drop the submission of such test nodes to KCIDB if required.

In that case, I understand we would submit only the test suite node as "INCOMPLETE"?

@JenySadadia this still leaves some open questions unsolved:

What about the results of intermediate nodes between the test suite and test cases? Will their results also depend on the results of their child nodes? If so, how?
Are these intermediate nodes even necessary? See this comment above:

Consider https://github.com/kernelci/kernelci-core/pull/2466, where the "path" of a test node would look something like this: checkout/kbuild/test_suite_x/test_set_y/test_case_z. Right now, every element of the path would have its own node. What if test_set_y wasn't a separate node?

Wouldn't these result meanings also need some kind of translation when submitted to KCIDB? AFAIK there's no concept of "partial fail" in KCIDB, I think "SKIP" is the closest result type.
Maybe not important long-term if we end up running all searches in KCIDB but, when searching for results in KernelCI, how would we distinguish test suites from test groups and test cases? I think this question is also valid for KCIDB if intermediate nodes are created and submitted.

JenySadadia commented 6 months ago

@hardboprobot One way to address the issues you've pointed out is:

For the job https://staging.kernelci.org:9000/viewer?node_id=6603f8c5f3d47fcde0b3cf19, we can create the below hierarchy:

20240521_114857

I have assigned kind=job for the test suites created by KernelCI. Intermediate test sets and leaf test nodes will have kind=test.

What about the results of intermediate nodes between the test suite and test cases? Will their results also depend on the results of their child nodes? If so, how?

We will manually calculate test suite results explained by @a-wai (we may need to drop partial-fail result and set fail if one of the child fails) i.e. result of kind=job nodes. We are storing intermediate test set/case result from LAVA callback data. In principle, we should traverse the whole hierarchy and set the result of an intermediate test set or test suite to fail if one of its child nodes fails. However, this won't make much of a difference from KCIDB's PoV as it will calculate the worst results of child nodes and display the test set result accordingly even if we send intermediate tests to KCIDB.

Are these intermediate nodes even necessary? See this comment above:

I think yes. I'd prefer to keep the test hierarchy with parent-child relationship.

Wouldn't these result meanings also need some kind of translation when submitted to KCIDB? AFAIK there's no concept of "partial fail" in KCIDB, I think "SKIP" is the closest result type.

Yes, it would need mapping with KCIDB schema.

A test suite (node with kind=job) will be assigned to state=done, result=pass if all its tests passed. The test suite node won't be submitted to KCIDB.
A test suite (node with kind=job) will be assigned to state=done, result=fail if one of its tests failed. The test suite node will be submitted to KCIDB.
- A test suite (node with kind=job) will be assigned to state=done, result=incomplete if setup tests failed. The test suite node will be submitted to KCIDB with result=miss.
- A test suite (node with kind=job) will be assigned to state=done, result=incomplete if setup test results are unknown. (In case of node timeout?) The test suite node will be submitted to KCIDB with result=miss.

Maybe not important long-term if we end up running all searches in KCIDB but, when searching for results in KernelCI, how would we distinguish test suites from test groups and test cases? I think this question is also valid for KCIDB if intermediate nodes are created and submitted.

Storing test suites with kind=job would be helpful to distinguish top-level test jobs.

r-c-n commented 6 months ago

@JenySadadia thanks for the explanation. I think you have a solid understanding of the problems and I'm sure you'll find a way to make all this work, so I'd say go ahead and good luck!

There are some things I still feel a bit confusing but I'm not saying they're wrong, they just feel inconsistent. But maybe they won't be a problem, we'll see:

In the example node you linked, for instance, tast-perf-x86-pineview will be a job, tast-perf-x86-pineview.tast will be a job, but tast-perf-x86-pineview.tast.filemanager.UIPerf.directory_list and tast-perf-x86-pineview.tast.filemanager.UIPerf.list_apps (which are test groups) will be a test. I understand the practical difference is that job nodes will have setup children.

Then, tast-perf-x86-pineview.tast.filemanager, tast-perf-x86-pineview.tast.filemanager.UIPerf and tast-perf-x86-pineview.tast.filemanager.UIPerf.directory_list.100_size aren't represented as individual nodes in KernelCI but they will be "test nodes" in KCIDB with their own aggregate result, as explained by Nick here.

As I say, a bit hard to grasp and also a bit counter-intuitive at first glance, but as long as this ends up being consistent and well-documented, I'm happy. Thanks for the ideas and the design, let me know if I can help you in any way.

JenySadadia commented 6 months ago

In the example node you linked, for instance, tast-perf-x86-pineview will be a job, tast-perf-x86-pineview.tast will be a job, but tast-perf-x86-pineview.tast.filemanager.UIPerf.directory_list and tast-perf-x86-pineview.tast.filemanager.UIPerf.list_apps (which are test groups) will be a test. I understand the practical difference is that job nodes will have setup children.

Yes, that's one of the suggestions.

Then, tast-perf-x86-pineview.tast.filemanager, tast-perf-x86-pineview.tast.filemanager.UIPerf and tast-perf-x86-pineview.tast.filemanager.UIPerf.directory_list.100_size aren't represented as individual nodes in KernelCI but they will be "test nodes" in KCIDB with their own aggregate result, as explained by Nick https://github.com/kernelci/kernelci-pipeline/pull/550#issuecomment-2095535500.

Looking at the nodes created from callback data https://staging.kernelci.org:9000/viewer?search=parent%3D6603fce9f3d47fcde0b3dea6), yes. Do you think it will cause issues?

As I say, a bit hard to grasp and also a bit counter-intuitive at first glance, but as long as this ends up being consistent and well-documented, I'm happy. Thanks for the ideas and the design, let me know if I can help you in any way.

Honestly, I don't want to rush to implement this if the proposal above doesn't look like full proof covering all the scenarios we want to address. I would rather make sure that we reach a consistent solution.

I'd appreciate it if we discuss this further and hear all the ideas that you and others might have to make this design better as I may be missing something important. Please pitch in @a-wai @nuclearcat @pawiecz. I am okay to drop this whole idea if we can come up with something better.

pawiecz commented 6 months ago

This thread raised several important topics - let me try to unpack them one by one (some might even require moving to a separate issues/discussion):

Adjusting test hierarchy (obviously) including restructuring nodes
Making clear distinction between data structure and presentation
Propagating state/errors through the tree structure

r-c-n commented 6 months ago

Honestly, I don't want to rush to implement this if the proposal above doesn't look like full proof covering all the scenarios we want to address. I would rather make sure that we reach a consistent solution.

@JenySadadia my 2c: maybe there isn't a single perfect solution. This is a quirky problem and no one seems to have a definitive answer. Maybe a good approach could be to start with a reasonable solution, sketch it and see if it's viable and if it fixes the known problems well enough. The worse that can happen is that there'll still be some open issues, and then you can refine the solution or try an alternative. But if we wait until we have the solution before we start to work on it we may end up stuck forever.

pawiecz commented 6 months ago

Adjusting test hierarchy (obviously) including restructuring nodes

I believe the solution suggested by @JenySadadia is a step in the right direction. I'd be even more confident about choosing this approach if there was a mock state machine available for it (mapping possible test results to data stored in nodes).

The suites that are currently run in the system (baseline, selected kselftest, tast) should provide a good coverage of what will be stored in the internal database throughout its lifetime - I'd suggest starting with those to locate potential edge-cases and specifying requirements for the adjusted structure to support. As @hardboprobot noted - there might be a need for some changes in the future but that's a place to start (and build confidence in the updated design).

spbnick commented 6 months ago

Sorry, I only just noticed I was mentioned here. Too many GitLab/GitHub notifications :see_no_evil:

I'll try to retell how KCIDB handles test status and what I think about intermediate statuses.

As mentioned above, the result for a test node is the maximum of all test results at this node, and all nodes below. So the following reported results:

ltp ERROR
ltp.atof01 PASS
ltp.fptest01 PASS
ltp.abs01 FAIL

Would mean that the ltp.abs01 test failed (meaning kernel is broken), and the ltp test suite has errored out / aborted for internal reasons (e.g. it failed to write a log file while wrapping up the abs01 execution).

The summarized result of ltp would be FAIL, because that's worse than ERROR. You could still access the unsummarized, raw ltp results and get your ERROR, though.

If we however get these results reported:

ltp FAIL
ltp.atof01 PASS
ltp.fptest01 PASS
ltp.abs01 FAIL

(Note the intermediate ltp node is being sent here with status FAIL) Then we would get summarized result of ltp as FAIL, as that would be the worst of all of them.

This, though, would be equivalent to just sending:

ltp.atof01 PASS
ltp.fptest01 PASS
ltp.abs01 FAIL

The only "loss" of data case would be when the parent node produces better result than its children. Something like this:

ltp PASS
ltp.atof01 PASS
ltp.fptest01 PASS
ltp.abs01 FAIL

Where KCIDB would summarize ltp as FAIL.

OTOH, is it really a good idea to do that? And to send such results, expecting the suite result to override the particular test's results?

JenySadadia commented 6 months ago

OTOH, is it really a good idea to do that? And to send such results, expecting the suite result to override the particular test's results?

I think the better approach would be to preserve a test suite result if a CI system has explicitly sent it. If not, KCIDB can manually calculate its result from the sub-nodes and set the worst result of all.

spbnick commented 6 months ago

Yes, there's no problem with sending them. You would still be able to access them independently, but the overall status of e.g. a build, or a revision would ignore the case of a suite showing better result than its tests.

JenySadadia commented 5 months ago

@hardboprobot Do we have any pending issue that needs to be addressed as part of this? Otherwise, we should close this issue.

r-c-n commented 5 months ago

We're now on a different situation after the latest changes and everyone seems happy with them, so I think we can close it. If we find we need to make further adjustments in the future due to new feature requests we can start another discussion.

Thanks!