teemtee / tmt

Test Management Tool
MIT License
86 stars 129 forks source link

Create Data Router reporter plugin #2481

Open thrix opened 1 year ago

thrix commented 1 year ago

For RH usage teams would benefit from having a reporter plugin for Data Router. Data router is a service to report your results to all places needed. It is RH only service.

https://url.corp.redhat.com/data-router-confluence

The requirement of the plugin is to accept authentication details via environment variables, so the plugin is easily usable in Testing Farm.

kkaarreell commented 1 year ago

Please, be aware that there are couple of issues associated with data router usage.

  1. It uses RP sync API and the import relies on importing Zip archive with junit. That means that the test framework needs to prepare junit format in a form that includes <failure> elements with respective tracebacks so that log processing by AA/TFA is possible.
  2. In order to import test logs to RP through DR one has to prepare specially crafted directory structure (included in the Zip archive AFAIK) that contains respective logs for each test where tests name matches the respective directory name. This is a problem for tmt tests are they typically contain several / characters in the test name.
  3. Zip import itself is currently not 100% reliable due to a bug in RP. It should be hopefully addressed in a future RP version but currently the bug is there and it is possible that it cannot be fully avoided for large testsuites. According to RP devs the right/future way of importing data is to use async API, which is what tmt reportportal plugin uses (AFAIK) and DR does not.
thrix commented 1 year ago

Let me bring data router folks to this conversation, so we can talk about this. (emailed them to participate in this issue)

yanan-fu commented 1 year ago

@olivergondza @waynesun09 ^^

olivergondza commented 1 year ago

Please, be aware that there are couple of issues associated with data router usage.

  1. It uses RP sync API and the import relies on importing Zip archive with junit. That means that the test framework needs to prepare junit format in a form that includes <failure> elements with respective tracebacks so that log processing by AA/TFA is possible.

This is correct. The XUnit format is the (only) way for your testing framework to communicate with Data Router. It is not that DR would perform no other interaction with the target systems except for XUnit upload, but it is the main and mandatory data source for DR (when it comes to publishing to RP or Polarion).

  1. In order to import test logs to RP through DR one has to prepare specially crafted directory structure (included in the Zip archive AFAIK) that contains respective logs for each test where tests name matches the respective directory name. This is a problem for tmt tests are they typically contain several / characters in the test name.

This is also correct. Although, we have run into use-cases that are pushing us to improve this as TMT is not the only source of test data with class/test names containing characters that are cumbersome to represent in filesystem hierarchy.

  1. Zip import itself is currently not 100% reliable due to a bug in RP. It should be hopefully addressed in a future RP version but currently the bug is there and it is possible that it cannot be fully avoided for large testsuites. According to RP devs the right/future way of importing data is to use async API, which is what tmt reportportal plugin uses (AFAIK) and DR does not.

Can you point us to the bug you are mentioning, and the unreliability observed? Thanks!

kkaarreell commented 1 year ago

Can you point us to the bug you are mentioning, and the unreliability observed? Thanks!

I believe it is https://github.com/reportportal/reportportal/issues/1515 which is supposed to be fixed in v5.8. I know DR has implemented some workaround for this issue by searching for the most recently created launch of a same name. However, this approach is not sufficient when there are multiple threads running in parallel an importing launches with the same name.

waynesun09 commented 1 year ago

Please, be aware that there are couple of issues associated with data router usage.

  1. It uses RP sync API and the import relies on importing Zip archive with junit. That means that the test framework needs to prepare junit format in a form that includes <failure> elements with respective tracebacks so that log processing by AA/TFA is possible.

This is correct. The XUnit format is the (only) way for your testing framework to communicate with Data Router. It is not that DR would perform no other interaction with the target systems except for XUnit upload, but it is the main and mandatory data source for DR (when it comes to publishing to RP or Polarion).

Yes, and currently TMT already helps with generating xUnit XML, though the error part for TFA is an issue for restraint beaker job results.

  1. In order to import test logs to RP through DR one has to prepare specially crafted directory structure (included in the Zip archive AFAIK) that contains respective logs for each test where tests name matches the respective directory name. This is a problem for tmt tests are they typically contain several / characters in the test name.

This is also correct. Although, we have run into use-cases that are pushing us to improve this as TMT is not the only source of test data with class/test names containing characters that are cumbersome to represent in filesystem hierarchy.

If you have extra supported artifacts you want to attach to the test items, you need to generate the specific structure. If not and you have already put the log context (the failed log especially) into the xUnit file, which TMT helped with, you don't need to create the extra attachment dir but just point to the xUnit xml file.

  1. Zip import itself is currently not 100% reliable due to a bug in RP. It should be hopefully addressed in a future RP version but currently the bug is there and it is possible that it cannot be fully avoided for large testsuites. According to RP devs the right/future way of importing data is to use async API, which is what tmt reportportal plugin uses (AFAIK) and DR does not.

Can you point us to the bug you are mentioning, and the unreliability observed? Thanks!

Yes, issue 1515 is merged into the new release. It's an RP bug that any client could only accommodate at the moment as the current deployment is with an old version internally.

Currently, internal deployment is still in the process of upgrading, we have issues in the Data Router tracking the RP API upgrades, until we have implemented and tested it out against the new RP release deployment we can't ship it out.

And NO, TMT also not using the ASYNC API: https://github.com/teemtee/tmt/blob/4ba07981f2313c5c30ec157b27e3964692711ecd/tmt/steps/report/reportportal.py#L82C5-L82C24

as in the code it's defined to use the v1 and in line: https://github.com/teemtee/tmt/blob/4ba07981f2313c5c30ec157b27e3964692711ecd/tmt/steps/report/reportportal.py#L147

it uses the v1 launch API to start the launch. We are waiting for RP to offer better ASYNC API support also. Please correct me if I have checked the wrong code source.

Current TMT implementation is the same with rp_preproc (the project has been put in deprecated status), we have switched to the launch/import API for better performance as we have all client requests that need to be passed through, we haven't got the mix-matching launch not found the issue though it's the risk we have chosen to take. And meanwhile, we've been waiting for RP to roll out the fix.

kkaarreell commented 1 year ago

If you have extra supported artifacts you want to attach to the test items, you need to generate the specific structure. If not and you have already put the log context (the failed log especially) into the xUnit file, which TMT helped with, you don't need to create the extra attachment dir but just point to the xUnit xml file.

In theory yes, but in practice in order to troubleshoot the problem one usually needs additional logs provided by the test, not just the basic log or traceback/error.

And NO, TMT also not using the ASYNC API: https://github.com/teemtee/tmt/blob/4ba07981f2313c5c30ec157b27e3964692711ecd/tmt/steps/report/reportportal.py#L82C5-L82C24

as in the code it's defined to use the v1 and in line:

https://github.com/teemtee/tmt/blob/4ba07981f2313c5c30ec157b27e3964692711ecd/tmt/steps/report/reportportal.py#L147

it uses the v1 launch API to start the launch. We are waiting for RP to offer better ASYNC API support also. Please correct me if I have checked the wrong code source.

This is correct, however there is one important difference. tmt is not using launch import through the zip archive but builds test items iteratively. Therefore it uses different API calls and these calls are in fact present also in v2. So you can replace v1 with v2 in the tmt code and the data import itself will work. I will file an issue for tmt to switch to v2 by default.

waynesun09 commented 1 year ago

If you have extra supported artifacts you want to attach to the test items, you need to generate the specific structure. If not and you have already put the log context (the failed log especially) into the xUnit file, which TMT helped with, you don't need to create the extra attachment dir but just point to the xUnit xml file.

In theory yes, but in practice in order to troubleshoot the problem one usually needs additional logs provided by the test, not just the basic log or traceback/error.

Yes, that's why we need pre-formatted dir structure with matching attachments. This is a common issue regarding using the RP Rest API, both Data Router and TMT or other clients need to face this. Currently, Data Router relies on users with testing framework or pipeline steps to do the pre-formatting, it's should not be part of Data Router scope as not knowing all the details without giving all the metadata with matching, which will be another pre-formatted JSON will be required from the user. That means no matter what tool users are using, there will be a pre-formatting step required to feed the tool, we don't see the dir structure as more complicated like generating a JSON structure file point to all file locations.

And NO, TMT also not using the ASYNC API: https://github.com/teemtee/tmt/blob/4ba07981f2313c5c30ec157b27e3964692711ecd/tmt/steps/report/reportportal.py#L82C5-L82C24 as in the code it's defined to use the v1 and in line: https://github.com/teemtee/tmt/blob/4ba07981f2313c5c30ec157b27e3964692711ecd/tmt/steps/report/reportportal.py#L147

it uses the v1 launch API to start the launch. We are waiting for RP to offer better ASYNC API support also. Please correct me if I have checked the wrong code source.

This is correct, however there is one important difference. tmt is not using launch import through the zip archive but builds test items iteratively. Therefore it uses different API calls and these calls are in fact present also in v2. So you can replace v1 with v2 in the tmt code and the data import itself will work. I will file an issue for tmt to switch to v2 by default.

With not using launch import, TMT uses 5 APIs by repeatedly calling the test item API depending on how many test cases are included to create one launch, this is costly from the client side with not all payloads requiring each test item update. With our current user payload not heavily exceeding 3000 test items per request, the best scenario is one launch/import API request could finish the import, the worst scenario is we have to have an extra test item API update call based on test case numbers, the middle and often is part of test item need update. The intention of reducing API calls is for better performance with throughput, also RP instances could face degrading service with handling too many requests, with more test cases in one launch you will face this more.

Then with test cases exceeding 3000, the launch import API will get 504 which should be fixed in the new RP release, we are waiting to see what's the improvement.

My team is the owner of rp_preproc and RP internal Operation, we have years of accumulative knowledge of RP relative development and management and it's shared with the internal community, and though we have been losing resources on maintainers, it's still our team's focus area with supporting better usage. I'm not actively tracking RP API improvements, but as I mentioned we are also waiting for better ASYNC support, from what I know the v2 is not ready proof so if you find it's ready and stable please let us know.

waynesun09 commented 1 year ago

As we are discussing the TMT Data Router plugin, as TMT already has formatted the results structure, I think it could be directly consumed by the Data Router with some transforming within the plugin.

kkaarreell commented 1 year ago

I am sorry, it was not my intention to compete with DR implementation. I was just trying to point out some issues that tmt will eventually have to deal with. I am also expecting that some changes will need to be made on DR side.

Yes, that's why we need pre-formatted dir structure with matching attachments. This is a common issue regarding using the RP Rest API, both Data Router and TMT or other clients need to face this. Currently, Data Router relies on users with testing framework or pipeline steps to do the pre-formatting, it's should not be part of Data Router scope as not knowing all the details without giving all the metadata with matching, which will be another pre-formatted JSON will be required from the user. That means no matter what tool users are using, there will be a pre-formatting step required to feed the tool, we don't see the dir structure as more complicated like generating a JSON structure file point to all file locations.

Well, my point was about the specific implementation in DR that uses test names as directory names. Clearly, this approach is troublesome when tests are using '/' character in name, which are basically all tmt tests. Moreover, it does not handle a situation where one test is executed multiple times in a test plan. DR would probably have to support unique test run identifiers (like tmt does) to deal with this problem and to be able to upload attachments to the correct test case run instance.

With not using launch import, TMT uses 5 APIs by repeatedly calling the test item API depending on how many test cases are included to create one launch, this is costly from the client side with not all payloads requiring each test item update. With our current user payload not heavily exceeding 3000 test items per request, the best scenario is one launch/import API request could finish the import, the worst scenario is we have to have an extra test item API update call based on test case numbers, the middle and often is part of test item need update.

OK, but is that compatible with RP plans? I know from RP PO that the scenario they focus on is a situation where test results are continuously reported during testsuite execution, not only on a level of test cases but even with higher granularity. So I assume that from their perspective high number of connections is not an issue. Since zip import is not available in API v2, do we know from RP whether they plan to support v1 also in future versions (e.g. v8)?

The intention of reducing API calls is for better performance with throughput, also RP instances could face degrading service with handling too many requests, with more test cases in one launch you will face this more.

I think this is questionable. The current import/timeout issue happens even on minimal load. We don't know how RP will behave with the fix since we don't have it. Doing many smaller requests may mean that RP won't be able to process them, but at the same time it may not be able to process small number of large requests. We would need to see real numbers.

Then with test cases exceeding 3000, the launch import API will get 504 which should be fixed in the new RP release, we are waiting to see what's the improvement.

Btw, I am seeing the issue even with 20+ test cases.

waynesun09 commented 1 year ago

I am sorry, it was my intention to compete with DR implementation. I was just trying to point out some issues that tmt will eventually have to deal with. I am also expecting that some changes will need to be made on DR side.

Thanks for the explanation. Yes, I agree, both tmt and DR need address certain issues, let's try to itemize it for better tracking.

Yes, that's why we need pre-formatted dir structure with matching attachments. This is a common issue regarding using the RP Rest API, both Data Router and TMT or other clients need to face this. Currently, Data Router relies on users with testing framework or pipeline steps to do the pre-formatting, it's should not be part of Data Router scope as not knowing all the details without giving all the metadata with matching, which will be another pre-formatted JSON will be required from the user. That means no matter what tool users are using, there will be a pre-formatting step required to feed the tool, we don't see the dir structure as more complicated like generating a JSON structure file point to all file locations.

Well, my point was about the specific implementation in DR that uses test names as directory names. Clearly, this approach is troublesome when tests are using '/' character in name, which are basically all tmt tests. Moreover, it does not handle a situation where one test is executed multiple times in a test plan. DR would probably have to support unique test run identifiers (like tmt does) to deal with this problem and to be able to upload attachments to the correct test case run instance.

Yes, I forgot to address the specific character including '/' here. First, special characters should not be set as test case name, testing frameworks commonly replace it with ., which could be translated back to a module or class dir path. So with TMT, either used as testing framework or scheduler/executor to run other frameworks, it should be formatted without '/'. I know this will cause some issue with current users test case mapping in certain management systems, but with some transforming code it should be manageable. I just filed issue in TMT to track this:

https://github.com/teemtee/tmt/issues/2509

And for Data Router side, we have issues tracking this and I have just cloned a specific one with set higher priority on fixing it.

With not using launch import, TMT uses 5 APIs by repeatedly calling the test item API depending on how many test cases are included to create one launch, this is costly from the client side with not all payloads requiring each test item update. With our current user payload not heavily exceeding 3000 test items per request, the best scenario is one launch/import API request could finish the import, the worst scenario is we have to have an extra test item API update call based on test case numbers, the middle and often is part of test item need update.

OK, but is that compatible with RP plans? I know from RP PO that the scenario they focus on is a situation where test results are continuously reported during testsuite execution, not only on a level of test cases but even with higher granularity. So I assume that from their perspective high number of connections is not an issue. Since zip import is not available in API v2, do we know from RP whether they plan to support v1 also in future versions (e.g. v8)?

The live streaming test case update it's not Data Router and TMT support (correct me if I'm wrong), both facilitate test result import after all testing have finished. Streaming will require direct testing framework support with embed the RP client code into it, which means developing on each specific testing framework with specific coding language binding, not every framework will support or accept that change. Then with regard to DR and internal use, the requirement have been raised in initial plan but development did not pick it up as implementation cost is way higher than possible benefit.

The intention of reducing API calls is for better performance with throughput, also RP instances could face degrading service with handling too many requests, with more test cases in one launch you will face this more.

I think this is questionable. The current import/timeout issue happens even on minimal load. We don't know how RP will behave with the fix since we don't have it. Doing many smaller requests may mean that RP won't be able to process them, but at the same time it may not be able to process small number of large requests. We would need to see real numbers.

We do try to benchmark, initial result is not satisfying as what I expect, further benchmark been planned but resource have been reduced which put benchmark under less priority backlog.

Another reason we try to reduce API call is from Operation side which pressure on RP not responding to any if RP could not handing work load itself.

What initial gain is at least we will have a launch, we have dispatched following test item update to another service, so the processing could have at least a better state with all test items been created. Rather than storming retry to have create launch and break on failed test item update and never finish the launch and meanwhile Operation complain excessive API calls and RP not responding.

Current DR implementation does saw the test item updater service spending 10+ minutes with a heavy payload (3000+ test cases with attribute update), but eventually could finish though have causing DR internal queue piling. I know current TMT implementation could work also, but it will also fail when the RP is not responding when under heavy load.

And with all this said, the fundamental issue is RP the project we are consuming have issues all clients need adjust to better consume.

Then with test cases exceeding 3000, the launch import API will get 504 which should be fixed in the new RP release, we are waiting to see what's the improvement.

Btw, I am seeing the issue even with 20+ test cases.

Yeah, so that depends on how much payload within a test case, we have a sample set and a sample RP instance with certain given compute resource, which is used to reproduce the issue. So, yes, 20+ test cases break the API is quite possible.

kkaarreell commented 1 year ago

Yes, I forgot to address the specific character including '/' here. First, special characters should not be set as test case name, testing frameworks commonly replace it with ., which could be translated back to a module or class dir path. So with TMT, either used as testing framework or scheduler/executor to run other frameworks, it should be formatted without '/'.

Do you have any links to support that claim? Because on the path from tmt to RP it is only DR who has issues with that. For example, looking at junit XML format description on IBM site the test case name seems to be loosely formatted, unlike the test case id. IMHO you should be rather using test case id when it is present, that's what ids are designed for at the first place.

OK, but is that compatible with RP plans? I know from RP PO that the scenario they focus on is a situation where test results are continuously reported during testsuite execution, not only on a level of test cases but even with higher granularity. So I assume that from their perspective high number of connections is not an issue. Since zip import is not available in API v2, do we know from RP whether they plan to support v1 also in future versions (e.g. v8)?

The live streaming test case update it's not Data Router and TMT support (correct me if I'm wrong), both facilitate test result import after all testing have finished.

Again, I am just stating what is the focus of RP developers according to their PO.

Streaming will require direct testing framework support with embed the RP client code into it, which means developing on each specific testing framework with specific coding language binding, not every framework will support or accept that change. Then with regard to DR and internal use, the requirement have been raised in initial plan but development did not pick it up as implementation cost is way higher than possible benefit.

In fact, the most common frameworks already support it. You can check the documentation list of client repos But again, I am not saying whether this approach is good or bad, I am just saying what is the focus of RP development team. If we choose a different direction it would be at least worth double-checking with RP team that it is compliant long-term.

olivergondza commented 1 year ago

In fact, the most common frameworks already support it. You can check the documentation list of client repos But again, I am not saying whether this approach is good or bad, I am just saying what is the focus of RP development team. If we choose a different direction it would be at least worth double-checking with RP team that it is compliant long-term.

I would argue that this is not a right fit for the use-case we face for DR, and not even at Red Hat at large. This tightly couples RP with a particular test suite or a library used in it. In the situation where we have multitude of testsuites utilizing a number of different testing frameworks and programming languages across the company, and the results required to be reported to more than one target system, we would need M*N integrations. By putting DR in between requiring a common "language" to report test results, we are at much saner M+N integration points.