revelrylabs / elixir-nodejs

An Elixir API for calling Node.js functions
MIT License
214 stars 31 forks source link

Implementation leaks still there (2.0.0) #51

Open Zinggi opened 4 years ago

Zinggi commented 4 years ago

Oh no, I made it worse :scream:

It looks like my PR that was supposed to get rid of implementation leaks introduced some new ones. I should have tested my changes thoroughly with our production code...

It looks like my GenServer that calls NodeJS.call(..) is sometimes getting back the messages sent by Port:

{:ok, [message: {#Port<0.2131>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.1656.0>]}

Also, I'm still getting the timeout messages in my mailbox:

** (FunctionClauseError) no function clause matching in CallerGenServer.handle_info/2
    (foo 1.1.0) lib/foo/callerGenServer.ex:91: CallerGenServer.handle_info({#Reference<0.221168823.2908225537.225316>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3

I suspect some race condition with timeouts...

Also, we might need to add a wrapper GenServer that catches all these messages such that the caller is truly isolated from the worker process.


Unfortunately I wont have much time to investigate exactly when and how it happens, as I'm soon leaving for vacations.
Otherwise I would solve it myself and contribute the solution. I feel obliged to fix it, but I won't have the time (earliest after ca. 4 weeks).

I'm terribly sorry about that.

oohnoitz commented 4 years ago

@Zinggi could you create a test repo that can easily reproduce this issue?

Zinggi commented 4 years ago

Yeah, that's the least I can do. Hope I can make it reproducible...

Zinggi commented 4 years ago

@oohnoitz Alright, I managed to reproduce the issue :relieved: : https://github.com/Zinggi/elixir-nodejs-implementation-leaks

Sample log output:

before call
result
{:error, "Call timed out."}
after call

17:54:37.361 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
    (elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892929.143995>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call

17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]

17:54:38.575 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}
after call

17:54:40.612 [error] GenServer Leaks.GenServer terminating
** (FunctionClauseError) no function clause matching in Leaks.GenServer.handle_info/2
    (elixir_nodejs_leaks 0.1.0) lib/leaks/gen_server.ex:20: Leaks.GenServer.handle_info({#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}, nil)
    (stdlib 3.13) gen_server.erl:680: :gen_server.try_dispatch/4
    (stdlib 3.13) gen_server.erl:756: :gen_server.handle_msg/6
    (stdlib 3.13) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message: {#Reference<0.4098332406.3456892932.142055>, {:error, :timeout}}
State: nil
before call
result
{:ok, 42}
after call

17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, []}}}, module: NodeJS.Worker, name: #PID<0.186.0>]

17:54:41.817 [error] [message: {#Port<0.8>, {:data, {:eol, '__elixirnodejs__UOSBsDUP6bp9IF5__[true,42]'}}}, module: NodeJS.Worker, name: #PID<0.186.0>]
before call
result
{:error, "Call timed out."}

...

Hope it helps.

I might still be able to reply while on vacation, but I won't be able to do any coding

oohnoitz commented 4 years ago

Had some time to do a bit more digging. It's definitely a timeout issue, with a few other things. Timeous are definitely awkward with the way GenServer works and still a bit confusing for me.

Even though we have a timeout specified for GenServer.call, it doesn't prevent the underlying process to continue running until there is an actual response. The port/process still remains running until there is a response, which means that we can be returning an error early, but we'll still send the response back at a much later time. This would explain why we get another message in the inbox with the resolved value from the GenServer.call we should have ignored.

https://hexdocs.pm/elixir/GenServer.html#call/3-timeouts has some details about the caller doing this. Though, we don't really want to have developers handle the discarding of that message in their code. Would need to mess with some other timeout values and combinations to be sure.

I have this working branch: https://github.com/revelrylabs/elixir-nodejs/commit/20f3c47a4bfe38591183e9eb7c89d7aa0ea6b8ed which does seem to solve the issue? But would need some more testing to make sure this is a good solution.

Zinggi commented 4 years ago

Thanks for looking into it. I can't really test it out for some time, but that commit looks like it could work.

However, maybe we also need a timeout on the JS side, since without it our next calls to the same worker would be blocked without canceling the previous call. This could be done with Promise.race and a setTimeout

Zinggi commented 3 years ago

Hey there Just wanted to let you know that I'm back from vacations and that I'll look into it again once I find time. Hopefully sometimes this week :crossed_fingers: