Test Flakiness due to not syncronization order-api with other services

In a previous pull request (PR), we experienced intermittent failures (approximately 1 out of every 15 executions) due to a 502 Bad Gateway error. I investigated the root cause of this issue, i the URLPR, I increased the timeout for the migration process, which partially addresses the problem. However, to enhance the stability of the System Under Test (SUT), we need a more maintainable solution. The normal execution flow includes logging the following information in the order-api: Now listening on: http://0.0.0.0:80 This indicates that the Microsoft.eShopOnContainers.Service.OrdersDB database has been created and populated with data. However, depending on the system load, there can be service desynchronization, causing the Backend for Frontends (BFF) to experience failures and report incorrect redirections. The failures are reported as migration timeout exceptions, as shown in the logs below:

fail: Microsoft.EntityFrameworkCore.Database.Command[20102]
      Failed executing DbCommand (150,167ms) [Parameters=[], CommandType='Text', CommandTimeout='150']
      IF SERVERPROPERTY('EngineEdition') <> 5
      BEGIN
          ALTER DATABASE [Microsoft.eShopOnContainers.Services.OrderingDb] SET READ_COMMITTED_SNAPSHOT ON;
      END;
Unhandled exception. Microsoft.Data.SqlClient.SqlException (0x80131904): Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
 ---> System.ComponentModel.Win32Exception (258): Unknown error 258
   at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action`1 wrapCloseInAction)
   at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.ThrowExceptionAndWarning(Boolean callerHasConnectionLock, Boolean asyncClose)
   at Microsoft.Data.SqlClient.TdsParserStateObject.CheckThrowSNIException()
   at Microsoft.Data.SqlClient.SqlCommand.CheckThrowSNIException()
   at Microsoft.Data.SqlClient.SqlCommand.InternalEndExecuteNonQuery(IAsyncResult asyncResult, Boolean isInternal, String endMethod)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryInternal(IAsyncResult asyncResult)
   at Microsoft.Data.SqlClient.SqlCommand.EndExecuteNonQueryAsync(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location ---
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Storage.RelationalCommand.ExecuteNonQueryAsync(RelationalCommandParameterObject parameterObject, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.MigrationCommandExecutor.ExecuteNonQueryAsync(IEnumerable`1 migrationCommands, IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.MigrationCommandExecutor.ExecuteNonQueryAsync(IEnumerable`1 migrationCommands, IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.MigrationCommandExecutor.ExecuteNonQueryAsync(IEnumerable`1 migrationCommands, IRelationalConnection connection, CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.CreateAsync(CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.SqlServer.Storage.Internal.SqlServerDatabaseCreator.CreateAsync(CancellationToken cancellationToken)
   at Microsoft.EntityFrameworkCore.Migrations.Internal.Migrator.MigrateAsync(String targetMigration, CancellationToken cancellationToken)
   at Program.<Main>$(String[] args) in /src/Services/Ordering/Ordering.API/Program.cs:line 67
   at Program.<Main>(String[] args)
ClientConnectionId:edf88c5d-4e04-4255-91ac-92210fee7c84
Error Number:-2,State:0,Class:11

The database error log also shows that it is not ready:

2024-05-22 12:57:32.46 Logon       Login failed for user 'sa'. Reason: Failed to open the explicitly specified database 'Microsoft.eShopOnContainers.Services.OrderingDb'. [CLIENT: 172.22.0.68]
2024-05-22 12:57:32.49 Logon       Error: 18456, Severity: 14, State: 38.

For more details, please refer to the following logs:

I attempted to fix the issue yesterday in this branch. I tried the following approaches:

Modify the SQL database health check: I attempted to query the database for the existence of the five tables. However, this approach was not effective because the various APIs, which are waiting for the database to be healthy, are responsible for creating the data and model. This led to a deadlock in the SUT, as the first service was waiting for a state that was never going to occur.
Check the status prior to executing the test cases: This was not useful because the system failure occurs beforehand. Although the test cases would run (with the database having the necessary tables), the involved services would not respond due the failure
Create a custom health check in the order-api service: I spent most of my time on this approach. My idea was to modify the order-api Dockerfile as follows:
```
# Custom health check for RETORCH parallelization
# Copy the health check script
COPY "Services/Ordering/Ordering.API/checklogs.sh" "/usr/local/bin/checklogs.sh"

# Make the script executable
RUN chmod +x "/usr/local/bin/checklogs.sh"
```
In the checklogs script, I tried to grep the message:
```
#!/bin/bash
if docker logs $HOSTNAME 2>&1 | grep -q "Now listening on: http://0.0.0.0:80"; then
  exit 0
else
  exit 1
fi
```
However, I encountered the following problems:
1. The message is printed to the container's stdout and can be accessed with the docker logs command. I considered allowing the container to access the Docker socket and execute Docker commands, but I discarded this idea due to security reasons.
2. I looked for a file (e.g., in /var/logs) containing the stdout content without success.
3. After discarding the first two options, I tried modifying the service to configure its logging system to also log information messages to a file by changing the appsettings.json as follows:
```
{
"Logging": {
  "LogLevel": {
    "Default": "Information",
    "Microsoft": "Warning",
    "Microsoft.Hosting.Lifetime": "Information"
  },
  "File": {
    "Path": "/logs/myapp.log",
    "LogLevel": {
      "Default": "Information"
    }
  }
}
}
```
However, this approach was also unsuccessful.

Do you have some idea about how I can deal it in a proper way @javiertuya ?

@augustocristian I don't clearly understand the problem, but if you have a bad gatewy error, the problem seems to be located at some proxy, or at the backend, not at the database. What is the architecture to connect from a frontend to the backend?

The system architecture is illustrated in the following diagram:

Our test suite validates the entire system starting from the webmvc frontend, which means using the Backend of Frontends webshoppingagg. In the official diagram (above), looks that the microservices follow the Single Responsibility Principle, each with a separate database instance. However, in reality, all the microservices are coupled and share the same container (sqldata_nametjob) where the different databases are housed.

Initially, the system's flakiness was due to the database not being ready to accept connections when the various APIs (identity, catalog, etc.) start querying and creating tables. This issue was resolved with a database health check, ensuring that the APIs now wait until the database is ready. The current problem is with the database migration (operations performed by different APIs) and the frontend. Under heavy load, the migration and creation of different databases take a significant amount of time. Consequently, the frontend and, by extension, the backend for the frontend (webshoppingagg) start querying with no results. To address this, I was customizing the health check to monitor the readiness of the service, ensuring that the frontend and the rest of the services are not deployed or run until everything is ready. I have debugged and found evidences (as described above) that this is the root of the problem, but there may be other underlying issues that I have not yet identified. Some more info: The stacktrace of the frontend starts reporting:

info: Yarp.ReverseProxy.Forwarder.HttpForwarder[48]
      UpgradeResponseCanceled: Copying the upgraded response body was canceled.
      System.Threading.Tasks.TaskCanceledException: The operation was canceled.
       ---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
       ---> System.Net.Sockets.SocketException (125): Operation canceled
         --- End of inner exception stack trace ---
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
         at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
         at System.Net.Http.HttpConnection.ReadAsync(Memory`1 destination)
         at System.Net.Http.HttpConnection.RawConnectionStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.RawConnectionStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
         at Yarp.ReverseProxy.Forwarder.StreamCopier.CopyAsync(Stream input, Stream output, Int64 promisedContentLength, StreamCopierTelemetry telemetry, ActivityCancellationTokenSource activityToken, Boolean autoFlush, CancellationToken cancellation)

In webshopingagg is propagated:

info: Yarp.ReverseProxy.Forwarder.HttpForwarder[9]
      Proxying to http://basket_api_tjobfdebug3/api/v1/basket/22f5de56-b3e1-4da2-84bd-8518dfe3a9fb HTTP/2 RequestVersionOrLower no-streaming
info: Yarp.ReverseProxy.Forwarder.HttpForwarder[56]
      Received HTTP/1.1 response 200.
info: Yarp.ReverseProxy.Forwarder.HttpForwarder[48]
      RequestCanceled: The request was canceled before receiving a response.
      System.Threading.Tasks.TaskCanceledException: The operation was canceled.
       ---> System.IO.IOException: The response ended prematurely.
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         --- End of inner exception stack trace ---
         at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
         at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
         at Yarp.ReverseProxy.Forwarder.HttpForwarder.SendAsync(HttpContext context, String destinationPrefix, HttpMessageInvoker httpClient, ForwarderRequestConfig requestConfig, HttpTransformer transformer, CancellationToken cancellationToken)
info: Yarp.ReverseProxy.Forwarder.HttpForwarder[9]

On first moments I thought that the problem was with the basket, but everything worked_

 Microsoft.Hosting.Lifetime[14]
      Now listening on: http://0.0.0.0:81
info: Microsoft.Hosting.Lifetime[14]
      Now listening on: http://0.0.0.0:80
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Development
info: Microsoft.Hosting.Lifetime[0]
      Content root path: /app
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: GrpcBasket.BasketService[0]
      Begin grpc call BasketService.UpdateBasketAsync for buyer id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.Repositories.RedisBasketRepository[0]
      Basket item persisted successfully.
info: GrpcBasket.BasketService[0]
      Begin grpc call from method /BasketApi.Basket/GetBasketById for basket id 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb
info: Basket.API.IntegrationEvents.EventHandling.OrderStartedIntegrationEventHandler[0]
      Handling integration event: 45e9a666-99d0-4bb6-b861-1437f30a3210 - (OrderStartedIntegrationEvent { Id = 45e9a666-99d0-4bb6-b861-1437f30a3210, CreationDate = 05/22/2024 13:08:51, UserId = 22f5de56-b3e1-4da2-84bd-8518dfe3a9fb })
info: GrpcBasket.BasketService[0]

Here are the complete logs comparing the Ordering API (HEALTHY) service with the ones that are causing failures. Additionally, the log from the proxy, which continuously reports the Ordering API as unhealthy, is also attached. webshoppingagg_tjobfdebug3.log webshoppingagg_tjobfdebug3.log ordering_api_tjobehealthy.log

@augustocristian If the current problem is with the database migration, you can take two different approaches:

Wait for the migrations at the container: The container entrypoint should start the database and run the migrations before reporting health status ok. I've seen other solutions that use a dedicted container to perform the migrations, this container is placed in the middle of the docker compose startup dependencies between backend and database.
The second approach is to move this to the tests. Not ideal, but as you are checking the existence tables before running the test, this check would be replaced by a check and run for pending migrations. This is a few lines using EF. Of course, in this case, the backend must be configured to do not execute migrations on startup.

@javiertuya I've implemented a version of the first solution you suggested. The ASPNET services have an internal health check available at the hostname/hc address. This health check allows us to halt execution if the order-api container is not ready, effectively moving the failure from execution stage to the setup stage. I've also modified the RETORCH tool scripts to enable debugging in this case. Previously, if the compose failed, the logs were not stored.

Two types of flakiness remain:

The order API flakiness is reduced with the health check. At least the failure is now shown during the setup of the SUT rather than in the tests themselves. (If the system is not ready, it doesn't make sense to execute the tests)
Another failure appears during longer executions of the order tests and is due to the behavior of the payment system itself. The orders automatically change state over time from awaiting validation → stock confirmed → pending → paid. This transition doesn't depend on any external factors. Should I modify or add a delay to the SUT to slow down this process and ensure the tests pass even when execution is slow? @javiertuya The changes performed are available in the PR #83

giis-uniovi / retorch-st-eShopContainers

Test Flakiness due to not syncronization order-api with other services #80