Queries are still made in a failed transaction, leading to errors being thrown

rkachowski commented 1 year ago

Description

When an sql transaction fails, fly_postgres_elixir will still try to call select CAST(pg_current_wal_insert_lsn() AS TEXT) in the context of the transaction, resulting in the error

** (Postgrex.Error) ERROR 25P02 (in_failed_sql_transaction) current transaction is aborted, commands ignored until end of transaction block

This results in the process crashing and prevents the error response result of an Ecto.Multi from being made available to the application code and being handled appropriately.

Expected result

Transactions / Ecto.Multi works as with default App.Repo.Local on the primary
- Ecto.Multi returns with a standard {:error, stage, error_response, changes_so_far} tuple
The library doesn't try to wait for lsn on a failed transaction

Actual result

Postgrex error

Notes

Failed test output

  2) test register_owner/1 returns error if owner already exists (app.AdminTest)
     test/app/admin_test.exs:36
     ** (Postgrex.Error) ERROR 25P02 (in_failed_sql_transaction) current transaction is aborted, commands ignored until end of transaction block

         query: select CAST(pg_current_wal_insert_lsn() AS TEXT)
     code: assert {:error, changeset} = Admin.register_owner(@valid_registration_params)
     stacktrace:
       (ecto_sql 3.7.0) lib/ecto/adapters/sql.ex:756: Ecto.Adapters.SQL.raise_sql_call_error/1
       (fly_postgres 0.2.6) lib/lsn/lsn.ex:67: Fly.Postgres.LSN.current_wal_insert/1
       (fly_postgres 0.2.6) lib/fly_postgres.ex:192: Fly.Postgres.__rpc_lsn__/4
       (fly_postgres 0.2.6) lib/fly_postgres.ex:169: Fly.Postgres.rpc_and_wait/4
       (ecto 3.7.1) lib/ecto/multi.ex:716: Ecto.Multi.apply_operation/5
       (elixir 1.12.3) lib/enum.ex:2385: Enum."-reduce/3-lists^foldl/2-0-"/3
       (ecto 3.7.1) lib/ecto/multi.ex:690: anonymous fn/5 in Ecto.Multi.apply_operations/5
       (ecto_sql 3.7.0) lib/ecto/adapters/sql.ex:1013: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
       (db_connection 2.4.0) lib/db_connection.ex:1512: DBConnection.run_transaction/4
       (ecto 3.7.1) lib/ecto/repo/transaction.ex:20: Ecto.Repo.Transaction.transaction/4
       (explorer 0.0.1) lib/app/admin.ex:69: app.Admin.register_owner/1
       test/app/admin_test.exs:38: (test)

brainlid commented 1 year ago

Hi @rkachowski,

Version 0.3.0 was released and it changes how the library works internally. I believe the change will help this situation but I'm uncertain of the full impact.

I would love your feedback!

Please note that there are breaking changes specifically around configuration. None of your existing application code is impacted.

brainlid commented 1 year ago

After thinking further, this will still be a problem but I believe it can be handled. Thanks!

brainlid commented 1 year ago

@rkachowski,

I'm unable to reproduce the error. Can you provide a bit more of an example for your Multi? When the Multi process fails, is it failing because something else first raised an exception inside the Multi?

In general, a Multi should only be called inside an Fly.Postgres.rpc_and_wait/3 call. The reason is, the Repo.transaction call (which should be Repo.Local.transaction because it doesn't makes sense to proxy a transaction call in a stand-alone process) needs to be executed on the primary database.

If the whole Multi is executed on the primary, then any transaction rollbacks happen there and it's all fine.

I would like to reproduce this so I can catch the error and give a more instructive error message about what to do.

I'd also like to update the README docs with an example using a Multi and explicit transaction control. Basically, we can't start a transaction on the local replica but proxy write operations to the primary, execute queries, and expect it to be inside a single transaction.

I'd love an example of the types of operations you are doing, the Multi structure, and how it is being executed (RPC proxied or not).

rkachowski commented 1 year ago

Hey @brainlid let me update the library and try again.

In the above situation something does indeed fail first, the application code then intends to handle a failed transaction but does not get access to this due to the above in_failed_sql_transaction error triggered by the select CAST(pg_current_wal_insert_lsn() AS TEXT) query causing the process to die.

In a related issue, Fly.Repo.transaction will actually execute on the local repository instead of on the remote (as described in the documentation) - I have created a pull request to address this. https://github.com/superfly/fly_postgres_elixir/pull/32

rkachowski commented 1 year ago

fwiw I have patched this locally to forward the error to the application upon execution. I'm not particularly happy about the implementation but it works for me https://github.com/superfly/fly_postgres_elixir/commit/7af3777f73dbad3ae492a418bb7811fc8f295c11

brainlid commented 1 year ago

Hi @rkachowski,

Thanks for the PR, I'll check it out. I'd like to understand more about the situation that's demonstrates the problem. Even a comment/pseudo example would be helpful.

brainlid commented 1 year ago

fwiw I have patched this locally to forward the error to the application upon execution. I'm not particularly happy about the implementation but it works for me 7af3777

I don't recommend this solution. It changes the result of the proxied function that we (as the library) know nothing about. The executed function may correctly and intentionally perform it's action and return an {:error, "something else you should know"}. The forked code will break that.

I really want to understand the situation that's causing the problem. With that situation and a way to reproduce it, I would use a try..rescue around the call to get the WAL value and handle it that way.

rkachowski commented 1 year ago

Under the current implementation, when the executed function returns an {:error, reason} tuple this value is passed directly into wait_for_lsn resulting in the error outlined above.

I really want to understand the situation that's causing the problem. With that situation and a way to reproduce it, I would use a try..rescue around the call to get the WAL value and handle it that way.

The wait_for_lsn function does not expect a failed result, which is in the set of possible outcomes for a database call. rpc_and_wait has nothing to wait for if the rpc has failed, which is a valid state for the previous operation and currently not a handled case. There is no WAL value to wait for when there is no change to replicate.

I am not happy with the solution for the reason you mentioned as changing the signature has significant consequences, yet the current api is based around a false assumption that there is always something to wait on. There is no consideration of transaction failure.

brainlid commented 1 year ago

Hi @rkachowski,

I'm using version 0.3.0 and try as I might, I can't create a situation where it fails. The wait_for_lsn function was refactored in v0.3.0 and doesn't exist anymore.

I'm creating an explicit transaction and rolling it back but not getting errors. I tried doing it on the primary through a Fly.Postgres.rpc_and_wait and without that. I've tried doing it through a failed Ecto.Multi.

Can you please upgrade to 0.3.0 and provide some code samples with problematic usage? It doesn't have to be compiling code even.

Here's my usage. In a LiveView from a replica region:

  def add_template_locations_transaction_fail(location_ids, account_id) do
    # NOTE: This returns an `{:error, :posting_not_allowed}` result.
    Repo.transaction(fn ->
      location_ids
      |> TemplateLocations.locations()
      |> Enum.each(fn location ->
        {:ok, _location} = Locations.create_location(account_id, Location.to_params(location))
      end)

      Logger.warn("INTENTIONALLY FAILED TRANSACTION")

      Repo.rollback(:posting_not_allowed)

      # `rollback/1` stops execution, so code here won't be run
    end)
  end

  # Function trigged by the LiveView that performs the inserts that fails.
  def handle_info({:add_template_locations, location_ids, "multi"}, socket) do
    # NOTE: I tried this version as well where it executes it directly 
    #       on the local replica without going to the primary.
    # result = add_template_locations_transaction_fail(location_ids, socket.assigns.current_account.id)

    result =
      Fly.Postgres.rpc_and_wait(__MODULE__, :add_template_locations_transaction_fail, [
        location_ids,
        socket.assigns.current_account.id
      ])

    IO.inspect(result, label: "FUNCTION RESULT?")

    {:noreply, socket}
  end

It's possible that the updated version solved the issue but it's also possible that I'm just missing something that you're doing. I'd love to help resolve the issue but I need some help.

rkachowski commented 1 year ago

Hey @brainlid, thanks for your attention. I will try to update and create a reduced test case when I get the chance, the actual issue I'm encountering is in quite a complicated transaction / multi and I want to get it down to size.

rkachowski commented 1 year ago

The error still seems to exist on 0.3.0

  1) test register_owner/1 returns error if owner already exists (Explorer.AdminTest)
     test/explorer/admin_test.exs:36
     ** (Postgrex.Error) ERROR 25P02 (in_failed_sql_transaction) current transaction is aborted, commands ignored until end of transaction block

         query: select CAST(pg_current_wal_insert_lsn() AS TEXT)
     code: assert {:error, changeset} = Admin.register_owner(@valid_registration_params)
     stacktrace:
       (ecto_sql 3.7.1) lib/ecto/adapters/sql.ex:760: Ecto.Adapters.SQL.raise_sql_call_error/1
       (fly_postgres 0.3.0) lib/lsn/lsn.ex:78: Fly.Postgres.LSN.current_wal_insert/1
       (fly_postgres 0.3.0) lib/fly_postgres.ex:200: Fly.Postgres.__rpc_lsn__/4
       (fly_postgres 0.3.0) lib/fly_postgres.ex:170: Fly.Postgres.rpc_and_wait/4
       (explorer 0.0.1) lib/explorer/admin.ex:58: anonymous fn/1 in Explorer.Admin.register_owner/1
       (ecto_sql 3.7.1) lib/ecto/adapters/sql.ex:1021: anonymous fn/3 in Ecto.Adapters.SQL.checkout_or_transaction/4
       (db_connection 2.4.2) lib/db_connection.ex:1562: DBConnection.run_transaction/4
       test/explorer/admin_test.exs:38: (test)

Here is a clunky test case that causes the same issue

  describe "fly postgres" do
    test "shouldn't allow duplicate names" do
      create_table = "CREATE TABLE awesome_names (name TEXT)"
      unique_index = "CREATE UNIQUE INDEX unique_names_constraint ON awesome_names (name)"

      Ecto.Adapters.SQL.query!(Repo.Local, create_table, [])
      Ecto.Adapters.SQL.query!(Repo.Local, unique_index, [])

      defmodule Name do
        use Ecto.Schema
        import Ecto.Changeset

        @primary_key false
        schema "awesome_names" do
          field :name, :string
        end

        def changeset(name, params \\ %{}) do
          name
          |> cast(params, [:name])
          |> unique_constraint(:name, name: :unique_names_constraint)
        end

        def insert(name) do
          %Name{}
          |> Name.changeset(%{name: name})
          |> Repo.insert()
        end
      end

      Name.insert("admin")

      # process will die here due to `(Postgrex.Error) ERROR 25P02 (in_failed_sql_transaction)`
      # and error value will not be made available to the application
      assert {:error, changeset} = Repo.transaction(fn ->
        with {:ok, inserted} <- Name.insert("admin") do
          inserted
        else
          {:error, error} ->
            Repo.rollback(error)
        end
      end)
    end
  end

brainlid commented 1 year ago

@rkachowski,

Thanks for the test case! I'll dig into it. :+1:

brainlid commented 1 year ago

@rkachowski,

Thank you for the error replication! It was very helpful.

I believe I have a fix ready and hope you can test it out before I make a new release.

To test it out, in your mix.exs change it like this...

      # {:fly_postgres, "~> 0.3.0"},
      {:fly_postgres, git: "https://github.com/superfly/fly_postgres_elixir.git", branch: "me-handle-db-errors"},

Then run:

mix deps.unlock fly_postgres
mix deps.get

And run your application.

Thanks!

rkachowski commented 1 year ago

Hey @brainlid, the fix works and the tests pass. Thanks again!

brainlid commented 1 year ago

Fixed in version v0.3.1

superfly / fly_postgres_elixir