sheharyarn / memento

Simple + Powerful interface to the Mnesia Distributed Database 💾
http://hexdocs.pm/memento/
MIT License
734 stars 23 forks source link

Getting Mnesia Operation Failed cyclic #2

Closed indocomsoft closed 5 years ago

indocomsoft commented 5 years ago

I'm using que without persistence. Currently after upgrading Que from 0.5.0 to 0.7.0 which changed dependency from mnesia to memento, I'm getting these errors:

Elixir.Memento.MnesiaException: Mnesia operation failed
   {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, :______WHOLETABLE_____}, :read, :read, {:tid, 138, #PID<0.5537.0>}}
   Mnesia Error: {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, :______WHOLETABLE_____}, :read, :read, {:tid, 138, #PID<0.5537.0>}}
  File "lib/memento/query/query.ex", line 728, in Memento.Query.autoincrement_key_for/1
  File "lib/memento/query/query.ex", line 701, in Memento.Query.prepare_record_for_write!/2
  File "lib/memento/query/query.ex", line 250, in Memento.Query.write/2
  File "lib/que/persistence/mnesia/db.ex", line 144, in anonymous fn/1 in Que.Persistence.Mnesia.DB.Jobs.update_job/1
  File "mnesia_tm.erl", line 836, in :mnesia_tm.apply_fun/3
  File "mnesia_tm.erl", line 812, in :mnesia_tm.execute_transaction/5
  File "lib/memento/transaction.ex", line 71, in Memento.Transaction.execute/2
  File "lib/memento/transaction.ex", line 84, in Memento.Transaction.execute!/2
  Module "Elixir.Que.Server", in Que.Server.init/1

Seems like there's a cyclic error in the transaction

sheharyarn commented 5 years ago

Hi @indocomsoft. Can you provide an MCVE? What did you call before getting this error? Anything that can point me to the cause of the problem will be really helpful.

indocomsoft commented 5 years ago

Currently I haven't yet had the time to create an MCVE. The code is open source, I'll guide you through it:

https://github.com/source-academy/cadet/blob/master/lib/cadet/jobs/autograder/utilities.ex#L21 Here I call Que.add(Cadet.Autograder.LambdaWorker, %{question: question, answer: answer})

That worker will then call another worker (https://github.com/source-academy/cadet/blob/master/lib/cadet/jobs/autograder/lambda_worker.ex#L38) Que.add(ResultStoreWorker, %{answer_id: answer.id, result: result})

I noticed that there are several more errors that seem to be related to Que persistence. This regression occurs only after upgrading Que from 0.5.0 to 0.7.0.

Seems like there are two kinds of error, :read and :write.

It is weird that Que.Persistence.Mnesia.DB.jobs is even called since I did not set up any config regarding persistence. The impression that I have is that if I don't set up anything, by default there's no persistence.

19:00:03.275 [error] GenServer {Que.Server, Cadet.Autograder.ResultStoreWorker} terminating
** (Memento.MnesiaException) Mnesia operation failed
   {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, :______WHOLETABLE_____}, :read, :read, {:tid, 138, #PID<0.2876.0>}}
   Mnesia Error: {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, :______WHOLETABLE_____}, :read, :read, {:tid, 138, #PID<0.2876.0>}}
    (memento) lib/memento/query/query.ex:728: Memento.Query.autoincrement_key_for/1
    (memento) lib/memento/query/query.ex:701: Memento.Query.prepare_record_for_write!/2
    (memento) lib/memento/query/query.ex:250: Memento.Query.write/2
    (que) lib/que/persistence/mnesia/db.ex:144: anonymous fn/1 in Que.Persistence.Mnesia.DB.Jobs.update_job/1
    (mnesia) mnesia_tm.erl:836: :mnesia_tm.apply_fun/3
    (mnesia) mnesia_tm.erl:812: :mnesia_tm.execute_transaction/5
    (memento) lib/memento/transaction.ex:71: Memento.Transaction.execute/2
    (memento) lib/memento/transaction.ex:84: Memento.Transaction.execute!/2
Last message (from #PID<0.3417.0>): {:add_job, Cadet.Autograder.ResultStoreWorker, %{answer_id: 142136, result: %{errors: [%{errors: [%{"errorType" => "runtime", "line" => 24, "location" => "student"}], grader_program: "redacted"}], grade: 0, status: :success}}}
19:00:03.276 [info] [Que] Spawning Server for worker: Cadet.Autograder.ResultStoreWorker
19:00:03.279 [error] Task #PID<0.3417.0> started from #PID<0.2876.0> terminating
** (stop) exited in: GenServer.call({:global, {Que.Server, Cadet.Autograder.ResultStoreWorker}}, {:add_job, Cadet.Autograder.ResultStoreWorker, %{answer_id: 142136, result: %{redacted}], grade: 0, status: :success}}}, 5000)
    ** (EXIT) an exception was raised:
        ** (Memento.MnesiaException) Mnesia operation failed
   {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, :______WHOLETABLE_____}, :read, :read, {:tid, 138, #PID<0.2876.0>}}
   Mnesia Error: {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, :______WHOLETABLE_____}, :read, :read, {:tid, 138, #PID<0.2876.0>}}
            (memento) lib/memento/query/query.ex:728: Memento.Query.autoincrement_key_for/1
            (memento) lib/memento/query/query.ex:701: Memento.Query.prepare_record_for_write!/2
            (memento) lib/memento/query/query.ex:250: Memento.Query.write/2
            (que) lib/que/persistence/mnesia/db.ex:144: anonymous fn/1 in Que.Persistence.Mnesia.DB.Jobs.update_job/1
            (mnesia) mnesia_tm.erl:836: :mnesia_tm.apply_fun/3
            (mnesia) mnesia_tm.erl:812: :mnesia_tm.execute_transaction/5
            (memento) lib/memento/transaction.ex:71: Memento.Transaction.execute/2
            (memento) lib/memento/transaction.ex:84: Memento.Transaction.execute!/2
    (elixir) lib/gen_server.ex:836: GenServer.call/3
    (elixir) lib/task/supervised.ex:88: Task.Supervised.do_apply/2
    (stdlib) proc_lib.erl:249: :proc_lib.init_p_do_apply/3
Function: #Function<3.1282180/0 in Que.Job.perform/1>
    Args: []
19:00:03.298 [error] GenServer {Que.Server, Cadet.Autograder.LambdaWorker} terminating
** (Memento.MnesiaException) Mnesia operation failed
   {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, 89}, :write, :write, {:tid, 143, #PID<0.3452.0>}}
   Mnesia Error: {:cyclic, :"cadet@127.0.0.1", {Que.Persistence.Mnesia.DB.Jobs, 89}, :write, :write, {:tid, 143, #PID<0.3452.0>}}
    (memento) lib/memento/query/query.ex:254: Memento.Query.write/2
    (que) lib/que/persistence/mnesia/db.ex:144: anonymous fn/1 in Que.Persistence.Mnesia.DB.Jobs.update_job/1
    (mnesia) mnesia_tm.erl:836: :mnesia_tm.apply_fun/3
    (mnesia) mnesia_tm.erl:812: :mnesia_tm.execute_transaction/5
    (memento) lib/memento/transaction.ex:71: Memento.Transaction.execute/2
    (memento) lib/memento/transaction.ex:84: Memento.Transaction.execute!/2
    (que) lib/que/server.ex:86: Que.Server.handle_call/3
    (stdlib) gen_server.erl:661: :gen_server.try_handle_call/4
sheharyarn commented 5 years ago

I think I understand what the issue is now. Let me get back to you after I test out some things.

indocomsoft commented 5 years ago

Alright sure. Let me know if you have any update!

sheharyarn commented 5 years ago

Hey @indocomsoft, I was unable to reproduce the bug, though I now have a general idea of why it might be occurring (basically the way Mnesia uses exits to signal transaction locks). Unless I can reproduce the issue reliably, I can't ensure my fix will work properly.

Would it be possible for you to generate a minimal example that uses Memento the way you are currently using while still reproducing the issue?

indocomsoft commented 5 years ago

I'm currently quite busy, will do so when I have more time! Thank you for investigating this issue

bsechter-unext commented 5 years ago

I am getting the same error while performing load testing with Tsung at 100 or so RPS.

Additionally, I found this. http://erlang.org/pipermail/erlang-questions/2013-January/071743.html

I see a catch on line 22 of lib/memento/mnesia.ex. Could this be the problem?

EDIT: Another note, when running with two nodes, the first one crashes pretty fast. The second one lives longer but also tends to also die eventually. This may be a problem with my setup, but I figured it was worth mentioning.

indocomsoft commented 5 years ago

@sheharyarn https://github.com/indocomsoft/memento_mvce here's the MVCE. Sorry it took so long!

Just clone and run mix deps.get && mix test.

bsechter-unext commented 5 years ago

At this point, I am 99% sure there is a problem with the try/catch architecture Memento uses. Every single operation that belongs in a transaction appears to be wrapped in a try/catch clause. https://stackoverflow.com/questions/8099261/unintentionally-intercepting-mnesias-transactional-retries-with-try-catch-resul

EDIT: No real problems for low RPS applications. Somewhere between 10 RPS and 100 RPS the overall success rate drops from over 99% to under 3% when exposed through a GraphQL API. The problem is the cyclic error.

indocomsoft commented 5 years ago

This is how amnesia handled it https://github.com/meh/amnesia/blob/master/lib/amnesia/helper.ex#L39 where it reraised the exception

I would write a PR to fix this, but I'm not that familiar with mnesia and amnesia.

@bsechter-unext what do you think { :aborted, { :amnesia, { :cancel, result } } } is?

sheharyarn commented 5 years ago

@indocomsoft @bsechter-unext Thank you guys for looking into this further, and I apologize for the late reply. I did a little digging of my own and also believe that the cause is the try/catch block. I'll try a couple of things over the weekend to get this to work and get back with an update.

indocomsoft commented 5 years ago

Oh, my bad. Apparently for amnesia, it only wraps try/catch for transaction. And {:aborted, {:amnesia, {:cancel, result}} pattern matching is for the case when the transaction is cancelled.

bsechter-unext commented 5 years ago

@bsechter-unext what do you think { :aborted, { :amnesia, { :cancel, result } } } is?

To me, this looks like an abort scenario that amnesia controls.

jiachen247 commented 5 years ago

Hi all, sorry is there any update on this issue? @sheharyarn

sheharyarn commented 5 years ago

Hey guys. Sorry about the delay. I've been occupied with some personal issues the past few months and haven't had a chance to focus on open-source outside of work.

I investigated this a bit and ran some small experiments so I do have a general idea on how to fix this, but just didn't get the time to implement and load-test it. Hopefully, I'll get some time over the weekend and will try to get back to you guys with an update.

I apologize again for delaying the fix.

bsechter-unext commented 5 years ago

FYI, I rewrote with direct :mnesia calls and the the problem went away. I did not recall using try/catch. If I did, it was outside of the transaction.

indocomsoft commented 5 years ago

@bsechter-unext do you mind sharing your fork?

bsechter-unext commented 5 years ago

@indocomsoft I did not fork Memento. I rewrote my data module using direct calls to :mnesia instead.

sheharyarn commented 5 years ago

I've made a few changes in #10 and I believe they've resolved the issue. I ran some basic load tests and performed operations in nested transactions and everything seems to be working fine. The mvce provided by @indocomsoft is also working with the new changes without any issues.

But before I merge the branch and publish a new release, I'd appreciate if you guys could try it out too and let me know if everything works properly.

Just update your dependencies to use the branch:

{:memento, github: "sheharyarn/memento", branch: "bugfix/mnesia-cyclic-exits", override: true}
indocomsoft commented 5 years ago

I tried it and it seems to work! 🎉

Thanks for the fix, we can finally start upgrading to Elixir 1.7 once the package is released to hex

jiachen247 commented 5 years ago

@sheharyarn thank you!! :-)

sheharyarn commented 5 years ago

Published a new release of memento with the changes (v0.3.0).

Let me know if you encounter this (or another) issue again. Thanks!

sheharyarn commented 5 years ago

Collecting important links and resources for future reference:

jiachen247 commented 5 years ago

Can the que package on hex be updated to this new version of memento as well?

sheharyarn commented 5 years ago

@jiachen247 Published v0.9 of que on hex.

mhsdef commented 5 years ago

@sheharyarn this is still a problem at high load.

I don't have time right now to setup a full repro, but here's the basic setup that can trigger the :cyclic error.

  1. Setup at least two nodes with RAM only copies.
  2. Write repeatedly to a particular record in a set from one node. Say, for 1..100000, do: increment a numeric column in the record by one on each write.
  3. Do some gets of that same record from the other node while that's happening.

You should be able to blow it up pretty easily.

** (Memento.MnesiaException) Mnesia operation failed
   {:cyclic, :"a@127.0.0.1", {MyTable, 42}, :write, :write, {:tid, 195134, #PID<15445.481.0>}}
   Mnesia Error: {:cyclic, :"a@127.0.0.1", {MyTable, 42}, :write, :write, {:tid, 195134, #PID<15445.481.0>}}
    (memento) lib/memento/transaction.ex:71: Memento.Transaction.execute/2
    (memento) lib/memento/transaction.ex:84: Memento.Transaction.execute!/2

It runs fine when I comment out the try/catch in call_and_catch

      # try do
      apply(:mnesia, fun, args)
      # catch
      #  :exit, error -> Memento.Error.raise_from_code(error)
      # end
mhsdef commented 5 years ago

UPDATE: possibly an artifact of other logic in my app. I seem to have it running smoothly under load--with the try/catch in place--after refactoring some other things.

I'll circle back here if it blows up more and bring a test example if I do.

mhsdef commented 5 years ago

Hitting more. I think I've got a lead on what's going on.

Nested transactions under significant load.