giis-uniovi / retorch-st-eShopContainers

RETORCH eShopContainers End-to-End Test Suite
Apache License 2.0
1 stars 0 forks source link

Test Flakiness due to not syncronization order-api with other services #80

Closed augustocristian closed 2 months ago

augustocristian commented 3 months ago

In a previous pull request (PR), we experienced intermittent failures (approximately 1 out of every 15 executions) due to a 502 Bad Gateway error. I investigated the root cause of this issue, i the URLPR, I increased the timeout for the migration process, which partially addresses the problem. However, to enhance the stability of the System Under Test (SUT), we need a more maintainable solution. The normal execution flow includes logging the following information in the order-api: Now listening on: http://0.0.0.0:80 This indicates that the Microsoft.eShopOnContainers.Service.OrdersDB database has been created and populated with data. However, depending on the system load, there can be service desynchronization, causing the Backend for Frontends (BFF) to experience failures and report incorrect redirections. The failures are reported as migration timeout exceptions, as shown in the logs below:

fail: Microsoft.EntityFrameworkCore.Database.Command[20102]
      Failed executing DbCommand (150,167ms) [Parameters=[], CommandType='Text', CommandTimeout='150']
      IF SERVERPROPERTY('EngineEdition') <> 5
      BEGIN
          ALTER DATABASE [Microsoft.eShopOnContainers.Services.OrderingDb] SET READ_COMMITTED_SNAPSHOT ON;
      END;
Unhandled exception. Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.CheckThrowSNIException()
   at Microsoft.Data.SqlClient.SqlCommand.CheckThrowSNIException()
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.MigrationCommandExecutor.ExecuteNonQueryAsync(IEnumerable`1 migrationCommands, IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.MigrationCommandExecutor.ExecuteNonQueryAsync(IEnumerable`1 migrationCommands, IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.MigrationCommandExecutor.ExecuteNonQueryAsync(IEnumerable`1 migrationCommands, IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.CreateAsync(CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.CreateAsync(CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.Migrator.MigrateAsync(String targetMigration, CancellationToken cancellationToken)
   at Program.<Main>$(String[] args) in /src/Services/Ordering/Ordering.API/Program.cs:line 67
   at Program.<Main>(String[] args)
ClientConnectionId:edf88c5d-4e04-4255-91ac-92210fee7c84
Error Number:-2,State:0,Class:11

The database error log also shows that it is not ready:

2024-05-22 12:57:32.46 Logon       Login failed for user 'sa'. Reason: Failed to open the explicitly specified database 'Microsoft.eShopOnContainers.Services.OrderingDb'. [CLIENT: 172.22.0.68]
2024-05-22 12:57:32.49 Logon       Error: 18456, Severity: 14, State: 38.

For more details, please refer to the following logs:

augustocristian commented 3 months ago

I attempted to fix the issue yesterday in this branch. I tried the following approaches:

augustocristian commented 3 months ago

Do you have some idea about how I can deal it in a proper way @javiertuya ?

javiertuya commented 3 months ago

@augustocristian I don't clearly understand the problem, but if you have a bad gatewy error, the problem seems to be located at some proxy, or at the backend, not at the database. What is the architecture to connect from a frontend to the backend?

augustocristian commented 3 months ago

The system architecture is illustrated in the following diagram: image

Our test suite validates the entire system starting from the webmvc frontend, which means using the Backend of Frontends webshoppingagg. In the official diagram (above), looks that the microservices follow the Single Responsibility Principle, each with a separate database instance. However, in reality, all the microservices are coupled and share the same container (sqldata_nametjob) where the different databases are housed.

Initially, the system's flakiness was due to the database not being ready to accept connections when the various APIs (identity, catalog, etc.) start querying and creating tables. This issue was resolved with a database health check, ensuring that the APIs now wait until the database is ready. The current problem is with the database migration (operations performed by different APIs) and the frontend. Under heavy load, the migration and creation of different databases take a significant amount of time. Consequently, the frontend and, by extension, the backend for the frontend (webshoppingagg) start querying with no results. To address this, I was customizing the health check to monitor the readiness of the service, ensuring that the frontend and the rest of the services are not deployed or run until everything is ready. I have debugged and found evidences (as described above) that this is the root of the problem, but there may be other underlying issues that I have not yet identified. Some more info: The stacktrace of the frontend starts reporting:

info: Yarp.ReverseProxy.Forwarder.HttpForwarder[48]
      UpgradeResponseCanceled: Copying the upgraded response body was canceled.
      System.Threading.Tasks.TaskCanceledException: The operation was canceled.
       ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
       ---> System.Net.Sockets.SocketException (125): Operation canceled
         --- End of inner exception stack trace ---
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
         at System.Net.Http.HttpConnection.ReadAsync(Memory`1 destination)
         at System.Net.Http.HttpConnection.RawConnectionStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.RawConnectionStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
         at Yarp.ReverseProxy.Forwarder.StreamCopier.CopyAsync(Stream input, Stream output, Int64 promisedContentLength, StreamCopierTelemetry telemetry, ActivityCancellationTokenSource activityToken, Boolean autoFlush, CancellationToken cancellation)

In webshopingagg is propagated:

info: Yarp.ReverseProxy.Forwarder.HttpForwarder[9]
      Proxying to http://basket_api_tjobfdebug3/api/v1/basket/22f5de56-b3e1-4da2-84bd-8518dfe3a9fb HTTP/2 RequestVersionOrLower no-streaming
info: Yarp.ReverseProxy.Forwarder.HttpForwarder[56]
      Received HTTP/1.1 response 200.
info: Yarp.ReverseProxy.Forwarder.HttpForwarder[48]
      RequestCanceled: The request was canceled before receiving a response.
      System.Threading.Tasks.TaskCanceledException: The operation was canceled.
       ---> System.IO.IOException: The response ended prematurely.
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at Yarp.ReverseProxy.Forwarder.HttpForwarder.SendAsync(HttpContext context, String destinationPrefix, HttpMessageInvoker httpClient, ForwarderRequestConfig requestConfig, HttpTransformer transformer, CancellationToken cancellationToken)
info: Yarp.ReverseProxy.Forwarder.HttpForwarder[9]

On first moments I thought that the problem was with the basket, but everything worked_

 Microsoft.Hosting.Lifetime[14]
      Now listening on: http://0.0.0.0:81
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: http://0.0.0.0:80
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Development
info: Microsoft.Hosting.Lifetime[0]
      Content root path: /app
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.IntegrationEvents.EventHandling.OrderStartedIntegrationEventHandler[0]
      Handling integration event: 45e9a666-99d0-4bb6-b861-1437f30a3210 - (OrderStartedIntegrationEvent { Id = 45e9a666-99d0-4bb6-b861-1437f30a3210, CreationDate = 05/22/2024 13:08:51, UserId = 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb })
info: GrpcBasket.BasketService[0]
augustocristian commented 3 months ago

Here are the complete logs comparing the Ordering API (HEALTHY) service with the ones that are causing failures. Additionally, the log from the proxy, which continuously reports the Ordering API as unhealthy, is also attached. webshoppingagg_tjobfdebug3.log webshoppingagg_tjobfdebug3.log ordering_api_tjobehealthy.log

javiertuya commented 3 months ago

@augustocristian If the current problem is with the database migration, you can take two different approaches:

augustocristian commented 3 months ago

@javiertuya I've implemented a version of the first solution you suggested. The ASPNET services have an internal health check available at the hostname/hc address. This health check allows us to halt execution if the order-api container is not ready, effectively moving the failure from execution stage to the setup stage. I've also modified the RETORCH tool scripts to enable debugging in this case. Previously, if the compose failed, the logs were not stored.

augustocristian commented 3 months ago

Two types of flakiness remain:

  1. The order API flakiness is reduced with the health check. At least the failure is now shown during the setup of the SUT rather than in the tests themselves. (If the system is not ready, it doesn't make sense to execute the tests)

  2. Another failure appears during longer executions of the order tests and is due to the behavior of the payment system itself. The orders automatically change state over time from awaiting validation → stock confirmed → pending → paid. This transition doesn't depend on any external factors. Should I modify or add a delay to the SUT to slow down this process and ensure the tests pass even when execution is slow? @javiertuya The changes performed are available in the PR #83