teslamate-org / teslamate

A self-hosted data logger for your Tesla 🚘
https://docs.teslamate.org
MIT License
5.96k stars 741 forks source link

random test failures #3736

Open brianmay opened 8 months ago

brianmay commented 8 months ago

Is there an existing issue for this?

What happened?

Noticed lots of test failures when running tests locally. The found I have to start tests with clean database. Oops.

But even then, was getting errors - up to 7. Then I found if I ran test files individually, was more like to work.

At a hunch I made the following change:

diff --git a/test/teslamate/vehicles/vehicle/charging_test.exs b/test/teslamate/vehicles/vehicle/charging_test.exs
index 106d64df..26fb4f65 100644
--- a/test/teslamate/vehicles/vehicle/charging_test.exs
+++ b/test/teslamate/vehicles/vehicle/charging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.ChargingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Log.ChargingProcess
diff --git a/test/teslamate/vehicles/vehicle/streaming_test.exs b/test/teslamate/vehicles/vehicle/streaming_test.exs
index 20f54723..c43db381 100644
--- a/test/teslamate/vehicles/vehicle/streaming_test.exs
+++ b/test/teslamate/vehicles/vehicle/streaming_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.StreamingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   import ExUnit.CaptureLog

diff --git a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
index ae3d6677..f282debe 100644
--- a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendLoggingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Vehicles.Vehicle
diff --git a/test/teslamate/vehicles/vehicle/suspend_test.exs b/test/teslamate/vehicles/vehicle/suspend_test.exs
index b89ee958..d8561f03 100644
--- a/test/teslamate/vehicles/vehicle/suspend_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary

Now all the tests pass for me. I suspect we have race conditions with multiple tests interfering with the database at the same time. Yuck.

Not sure I like this, but I think as our tests access the database they are more like integration tests not unit tests, and as a result they really do need to be run sequentially.

Expected Behavior

Tests pass without failures.

Steps To Reproduce

MIX_ENV=test mix test

Relevant log output

(after fix applied)
$ MIX_ENV=test mix test
....................................................................................................................................................................................................................................................................................................................................
Finished in 23.3 seconds (4.4s async, 18.8s sync)
324 tests, 0 failures

Randomized with seed 392195

Screenshots

No response

Additional data

No response

Type of installation

Docker

Version

master

brianmay commented 8 months ago

This still doesn't fix all the intermittent failures, but I now only see 1, not up to 7:

  1) test resumes logging when starting a drive (TeslaMate.Vehicles.Vehicle.StreamingTest)
     test/teslamate/vehicles/vehicle/streaming_test.exs:452
     Assertion failed, no matching message after 300ms
     The following variables were pinned:
       car = %TeslaMate.Log.Car{__meta__: #Ecto.Schema.Metadata<:built, "cars">, id: 55174, name: nil, efficiency: nil, model: "3", trim_badging: nil, marketing_name: nil, exterior_color: nil, wheel_type: nil, spoiler_type: nil, eid: 0, vid: 1000, vin: "1000", settings_id: nil, settings: %TeslaMate.Settings.CarSettings{__meta__: #Ecto.Schema.Metadata<:built, "car_settings">, id: nil, suspend_min: 999999999, suspend_after_idle_min: 15, req_not_unlocked: true, free_supercharging: false, use_streaming_api: true, car: #Ecto.Association.NotLoaded<association :car is not loaded>}, charging_processes: #Ecto.Association.NotLoaded<association :charging_processes is not loaded>, positions: #Ecto.Association.NotLoaded<association :positions is not loaded>, drives: #Ecto.Association.NotLoaded<association :drives is not loaded>, inserted_at: nil, updated_at: nil}
     The process mailbox is empty.
     code: assert_receive {:start_drive, ^car}
     stacktrace:
       test/teslamate/vehicles/vehicle/streaming_test.exs:478: (test)
brianmay commented 8 months ago

Just saw 2 more test failures.

I think the async flag needs to be turned off for all the tests in this directory.

diff --git a/test/teslamate/vehicles/vehicle/charging_test.exs b/test/teslamate/vehicles/vehicle/charging_test.exs
index 106d64df..26fb4f65 100644
--- a/test/teslamate/vehicles/vehicle/charging_test.exs
+++ b/test/teslamate/vehicles/vehicle/charging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.ChargingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Log.ChargingProcess
diff --git a/test/teslamate/vehicles/vehicle/driving_test.exs b/test/teslamate/vehicles/vehicle/driving_test.exs
index 11ac512e..38078ccc 100644
--- a/test/teslamate/vehicles/vehicle/driving_test.exs
+++ b/test/teslamate/vehicles/vehicle/driving_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.DrivingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Log.Drive
diff --git a/test/teslamate/vehicles/vehicle/streaming_test.exs b/test/teslamate/vehicles/vehicle/streaming_test.exs
index 20f54723..c43db381 100644
--- a/test/teslamate/vehicles/vehicle/streaming_test.exs
+++ b/test/teslamate/vehicles/vehicle/streaming_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.StreamingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   import ExUnit.CaptureLog

diff --git a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
index ae3d6677..f282debe 100644
--- a/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_logging_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendLoggingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
   alias TeslaMate.Vehicles.Vehicle
diff --git a/test/teslamate/vehicles/vehicle/suspend_test.exs b/test/teslamate/vehicles/vehicle/suspend_test.exs
index b89ee958..d8561f03 100644
--- a/test/teslamate/vehicles/vehicle/suspend_test.exs
+++ b/test/teslamate/vehicles/vehicle/suspend_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.SuspendTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary

diff --git a/test/teslamate/vehicles/vehicle/updating_test.exs b/test/teslamate/vehicles/vehicle/updating_test.exs
index d26cefb5..fc1a8fd6 100644
--- a/test/teslamate/vehicles/vehicle/updating_test.exs
+++ b/test/teslamate/vehicles/vehicle/updating_test.exs
@@ -1,5 +1,5 @@
 defmodule TeslaMate.Vehicles.Vehicle.UpdatingTest do
-  use TeslaMate.VehicleCase, async: true
+  use TeslaMate.VehicleCase, async: false

   alias TeslaMate.Vehicles.Vehicle.Summary
brianmay commented 8 months ago

Still getting the above error, but only rarely, but appears to be much better now.

Sometimes tests come up with the following message:

14:27:33.383 [error] Postgrex.Protocol (#PID<0.411.0>) disconnected: ** (DBConnection.ConnectionError) client #PID<0.19679.0> exited

But this doesn't seem to matter.

JakobLichterfeld commented 8 months ago

Nice catch. Unreliable tests are a nightmare. I do see the asynchronous run approach to match reality therefore vote to turn asnyc off.

JakobLichterfeld commented 8 months ago

I pinned our CI to elixir-1.16.1-otp-26 as well and the tests run perfectly again, see #3733

JakobLichterfeld commented 8 months ago

But came back when running on master, so unreliable with async test run.

brianmay commented 8 months ago

Yes, I upgraded my box to elixir 1.16, and I still get lots of test failures.

For some reason CI is more likely to have passing tests. Tests on my box never work without turning async off. Wonder if this is because the multi-tasking performance of my box is superior.

brianmay commented 8 months ago

Except this one test, I sometimes get failures here still:

  1) test resumes logging when starting a drive (TeslaMate.Vehicles.Vehicle.StreamingTest)
     test/teslamate/vehicles/vehicle/streaming_test.exs:452
     Assertion failed, no matching message after 300ms
     The following variables were pinned:
       car = %TeslaMate.Log.Car{__meta__: #Ecto.Schema.Metadata<:built, "cars">, id: 11376, name: nil, efficiency: nil, model: "3", trim_badging: nil, marketing_name: nil, exterior_color: nil, wheel_type: nil, spoiler_type: nil, eid: 0, vid: 1000, vin: "1000", settings_id: nil, settings: %TeslaMate.Settings.CarSettings{__meta__: #Ecto.Schema.Metadata<:built, "car_settings">, id: nil, suspend_min: 999999999, suspend_after_idle_min: 15, req_not_unlocked: true, free_supercharging: false, use_streaming_api: true, car: #Ecto.Association.NotLoaded<association :car is not loaded>}, charging_processes: #Ecto.Association.NotLoaded<association :charging_processes is not loaded>, positions: #Ecto.Association.NotLoaded<association :positions is not loaded>, drives: #Ecto.Association.NotLoaded<association :drives is not loaded>, inserted_at: nil, updated_at: nil}
     The process mailbox is empty.
     code: assert_receive {:start_drive, ^car}
     stacktrace:
       test/teslamate/vehicles/vehicle/streaming_test.exs:478: (test)

But it is always this one test, so wondering if that test is somehow broken or racey.

brianmay commented 8 months ago

Wonder if that timeout is just too short. Would hope that 300ms is plenty. But will try to reproduce with a timeout of 1000ms.

Might be hard to verify, a successful run doesn't mean anything.

brianmay commented 8 months ago

New theory: maybe tests are actually failing due to assert_receive timing out. This might explain "Assertion failed, no matching message after 300ms" message, but not the one "Unexpectedly received message" message.

But don't particularly want to update each and every assert_receive and refute_receive line by hand. If only we could set this globally. Oh, wait, we can.

ELIXIR_ASSERT_TIMEOUT=2000 mix test

But failures are persisting. Got 3-4 failures. And messages are showing new timeouts.

Just to be sure, I made the following change:

diff --git a/test/test_helper.exs b/test/test_helper.exs
index 58b9a493..3e6227c3 100644
--- a/test/test_helper.exs
+++ b/test/test_helper.exs
@@ -13,4 +13,4 @@ TeslaMate.Repo.start_link()
 Phoenix.PubSub.Supervisor.start_link(name: TeslaMate.PubSub)

 assert_timeout = String.to_integer(System.get_env("ELIXIR_ASSERT_TIMEOUT") || "300")
-ExUnit.start(assert_receive_timeout: assert_timeout)
+ExUnit.start(assert_receive_timeout: assert_timeout, refute_receive_timeout: assert_timeout)

As it bothered me that we set one, but not the other.

ELIXIR_ASSERT_TIMEOUT=2000 mix test

Now got 51+ test failures. Oops. I guess that was a bad change to make. But I think it does show that our tests are very timing sensitive, and perhaps this is not a good thing.

I don't think any of this helps, but documenting in case it gives anyone ideas.

brianmay commented 8 months ago

I rewrote the above text when I realised that increasing refute_receive_timeout increases the number of test failures.

JakobLichterfeld commented 7 months ago

But I think it does show that our tests are very timing sensitive, and perhaps this is not a good thing.

Total agree. Before looking deeper into it, I always thought our tests were great.

brianmay commented 7 months ago

To be fair, writing effective and efficient tests when you have lots of processes interacting like this is... hard.

Not really what is considered good practise in fact. I just know how to criticise bad practise :-)

brianmay commented 7 months ago

Still getting some errors. I think a lot less though then before.

This one I just saw is weird: https://github.com/teslamate-org/teslamate/actions/runs/8288206183/job/22682287560?pr=3485

** (UndefinedFunctionError) function System.convert_time_unit/3 is undefined (module System is not available)
    (elixir 1.16.1) System.convert_time_unit(-576460[71](https://github.com/teslamate-org/teslamate/actions/runs/8288206183/job/22682287560?pr=3485#step:10:72)9104225362, :native, :millisecond)
    (db_connection 2.6.0) lib/db_connection/connection_pool.ex:357: DBConnection.ConnectionPool.start_idle/2
** (EXIT from #PID<0.99.0>) shutdown

    (db_connection 2.6.0) lib/db_connection/connection_pool.ex:205: DBConnection.ConnectionPool.drop_idle/6
    (stdlib 5.2.1) gen_server.erl:1095: :gen_server.try_handle_info/3
    (stdlib 5.2.1) gen_server.erl:1183: :gen_server.handle_msg/6
    (stdlib 5.2.1) proc_lib.erl:241: :proc_lib.init_p_do_apply/3
Last message: {:timeout, #Reference<0.821106270.461373444.45001>, -576460[72](https://github.com/teslamate-org/teslamate/actions/runs/8288206183/job/22682287560?pr=3485#step:10:73)0104900933}

Either the function exists or it doesn't exist. It can't just not exist occasionally...

brianmay commented 7 months ago

As expected, rebuild solved that.

brianmay commented 7 months ago

In case it is not clear, I am still seeing one test failure. Occasionally. Not easy to reproduce on demand however.

JakobLichterfeld commented 7 months ago

On CI I do not see any failures

brianmay commented 7 months ago

I think I saw this failure recently on CI. But can't find it right now.

brianmay commented 7 months ago

This is the random failure:

  1) test resumes logging when starting a drive (TeslaMate.Vehicles.Vehicle.StreamingTest)
     test/teslamate/vehicles/vehicle/streaming_test.exs:452
     Assertion failed, no matching message after 300ms
     The following variables were pinned:
       car = %TeslaMate.Log.Car{__meta__: #Ecto.Schema.Metadata<:built, "cars">, id: 49656, name: nil, efficiency: nil, model: "3", trim_badging: nil, marketing_name: nil, exterior_color: nil, wheel_type: nil, spoiler_type: nil, eid: 0, vid: 1000, vin: "1000", settings_id: nil, settings: %TeslaMate.Settings.CarSettings{__meta__: #Ecto.Schema.Metadata<:built, "car_settings">, id: nil, suspend_min: 999999999, suspend_after_idle_min: 15, req_not_unlocked: true, free_supercharging: false, use_streaming_api: true, car: #Ecto.Association.NotLoaded<association :car is not loaded>}, charging_processes: #Ecto.Association.NotLoaded<association :charging_processes is not loaded>, positions: #Ecto.Association.NotLoaded<association :positions is not loaded>, drives: #Ecto.Association.NotLoaded<association :drives is not loaded>, inserted_at: nil, updated_at: nil}
     The process mailbox is empty.
     code: assert_receive {:start_drive, ^car}
     stacktrace:
       test/teslamate/vehicles/vehicle/streaming_test.exs:478: (test)
JakobLichterfeld commented 7 months ago

CI got test failure as well here: https://github.com/teslamate-org/teslamate/actions/runs/8508853689/job/23303206117?pr=3800