giis-uniovi / retorch-st-eShopContainers

RETORCH eShopContainers End-to-End Test Suite
Apache License 2.0
1 stars 0 forks source link

Failed or Flaky TJobC after an update #56

Closed javiertuya closed 5 months ago

javiertuya commented 6 months ago

@augustocristian @ClaudiodelaRiva The build of this combined update #55 had everything in green, but after merge into master failed. I ran two more executions in master and failed too. All failures are located at the same tjob.

augustocristian commented 6 months ago

I've been this issue regarding the logs of the various containers. The root of the problem is the database responsible for supplying information to the different services. When other containers start querying, it isn't ready, resulting in failures that prevent access to the different services service for those querying it earlier than the tables are ready. image image How can we address this? Employing an explicit wait is an not polite and ineffective solution. As parallelism increases, requires to extend this wait, which is not practical. While all containers in eShopContainers have a health check at URL/hc, it doesn't provide insights into the state of the database; it only confirms the status of the service, always indicating 'Healthy' when it's operational. @javiertuya

javiertuya commented 6 months ago

@augustocristian

giis-qabot commented 6 months ago

@augustocristian This is a reminder about this issue because it has not been updated for 10 days

giis-qabot commented 6 months ago

@augustocristian This is a reminder about this issue because it has not been updated for 10 days

augustocristian commented 6 months ago

This issue is related to problems with the proxies again. I need to do further research to determine what is causing the issue, whether it's a lack of resources, configuration files, or something else entirely.

augustocristian commented 5 months ago

Root cause located:

Unhandled exception. System.IO.IOException: The configured user limit (128) on the number of inotify instances has been reached, or the per-process limit on the number of open file descriptors has been reached.
   at System.IO.FileSystemWatcher.StartRaisingEvents()
   at Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher.TryEnableFileSystemWatcher()
   at Microsoft.Extensions.FileProviders.Physical.PhysicalFilesWatcher.CreateFileChangeToken(String filter)
   at Microsoft.Extensions.Primitives.ChangeToken.OnChange(Func`1 changeTokenProducer, Action changeTokenConsumer)
   at Microsoft.Extensions.Configuration.FileConfigurationProvider..ctor(FileConfigurationSource source)
   at Microsoft.Extensions.Configuration.Json.JsonConfigurationSource.Build(IConfigurationBuilder builder)
   at Microsoft.Extensions.Configuration.ConfigurationManager.AddSource(IConfigurationSource source)
   at Microsoft.Extensions.Configuration.ConfigurationManager.Microsoft.Extensions.Configuration.IConfigurationBuilder.Add(IConfigurationSource source)
   at Microsoft.Extensions.Hosting.HostingHostBuilderExtensions.ApplyDefaultAppConfiguration(HostBuilderContext hostingContext, IConfigurationBuilder appConfigBuilder, String[] args)
   at Microsoft.Extensions.Hosting.HostApplicationBuilder..ctor(HostApplicationBuilderSettings settings)
   at Microsoft.AspNetCore.Builder.WebApplicationBuilder..ctor(WebApplicationOptions options, Action`1 configureDefaults)
   at Microsoft.AspNetCore.Builder.WebApplication.CreateBuilder(String[] args)
   at Program.<Main>$(String[] args) in /src/ApiGateways/Mobile.Bff.Shopping/aggregator/Program.cs:line 1
   at Program.<Main>(String[] args)

Trying to addres it: https://github.com/dotnet/AspNetCore.Docs/issues/19814

augustocristian commented 5 months ago

The system is now significantly more stable. By executing the problematic test case in parallel with five instances, flakiness failing has been greatly reduced: image

In the event of a failure, the system's current state is healthier, and the performance of Selenium's screens and records appears to be normal: EJ, ex 26t TJobC: image In the PR #65, its the current solution.

augustocristian commented 5 months ago

Some improvements: https://stackoverflow.com/questions/60539114/how-to-wait-for-mssql-in-docker-compose

augustocristian commented 5 months ago

I may not have found the best solution, but I managed to solve the problem. Initially, I attempted to use the health check feature in the docker-compose file, but this would only work if the data scheme was loaded in the container itself, which came from another service. Finally, I implemented a waiting directive in the waitingforSUT.sh script that dynamically waits for both the database to have products inside and for the frontend to be ready.

Executing several times to check that everything work as expected...

javiertuya commented 5 months ago

@augustocristian @ClaudiodelaRiva Do you refer to this?

    if curl --insecure -s "$WEB_SERVICE_URL" | grep -q "<div class=\"esh-catalog-item col-md-4\">"; then
      # Check if database service is up
      if docker exec "$DB_SERVICE_NAME" /opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P '**********' -Q "$QUERY" | grep -q "14"; then
        break
      fi
    fi

IMO this is not a good option, because it seems that:

Remember what I mentioned here https://github.com/giis-uniovi/retorch-st-eShopContainers/issues/56#issuecomment-2006745863: execute this kind of stuff from the test setup.

augustocristian commented 5 months ago

I've resolved the issues with the database. Now, there's a custom method querying the problematic table every 5 seconds to enable the test cases to start. However, we're encountering a different type of trouble now (I discussed this with @ClaudiodelaRiva a couple of days ago). This problem arises in 1 out of every 20 executions in the past but now is continuously causing the test suite to fail.

Within our system, we utilize two proxies: one for the mobile front end and another for both the MVC and SPA fronts. These proxies require a custom file containing the different routes of the services.

Previously, when I began to parallelize the system, I created a separate directory containing the customized envoy.yml configuration file. This file is then mounted to each container at $WORKINGDIR/sut/src/tmp/$TJobname/mobile or /web.

The application is failing because although the file is correctly copied and present when the compose starts, the envoy proxy reports that the file doesn't exist.

[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:238] initializing epoch 0 (hot restart version=11.104)
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:240] statically linked extensions:
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:242]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:245]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:248]   filters.listener: envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:251]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:253]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:255]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.zipkin
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:258]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:261]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.tap,raw_buffer,tls
[2024-04-21 20:35:44.724][1][info][main] [source/server/server.cc:267] buffer implementation: old (libevent)
[2024-04-21 20:35:44.739][1][critical][main] [source/server/server.cc:93] error initializing configuration '/etc/envoy/envoy.yaml': Invalid path: /etc/envoy/envoy.yaml
[2024-04-21 20:35:44.739][1][info][main] [source/server/server.cc:560] exiting
Invalid path: /etc/envoy/envoy.yaml

Following this error, everything begins to malfunction. I've attempted to relocate the configuration files to a different directory and started exploring how to create each image without using these bind volumes (instead using a custom volume onto which we copy the envoy.yml file), but I'm uncertain if this will resolve anything.

Additionally, I've thoroughly reviewed all the documentation and some closed issues in the Docker repository, but none provide any clues about what might be happening. In "local" environments (both Linux and Windows), I am unable to reproduce this error. The content of the build-and-deploy.sh script is similar to that of the TJobs setup script. I cannot connect to the proxies containers because are in exited state from its start. @javiertuya

giis-qabot commented 5 months ago

@augustocristian This is a reminder about this issue because it has not been updated for 10 days

augustocristian commented 5 months ago

Now the master and update branches look pretty more stable: Docker Version 26.1.2, build 211e74b Docker Compose version v2.27.0 Jenkins agent jdk-17 (already fixed versión)

augustocristian commented 5 months ago

(removed comment, commit si not available anymore)