jackaudio / jack2

jack2 codebase
GNU General Public License v2.0
2.22k stars 377 forks source link

KILL_MASTER-Signal via local multicast packet #239

Open jreusch-barco opened 8 years ago

jreusch-barco commented 8 years ago

hi i encountered a strange problem where on a jackd instance running as master (with alsa backend) some netjack2 client handles weren't closed(=destructed) properly.

on the master the JackNetMasterInterface::Recv(...) method calls JackNetMasterInterface::FatalRecvError() which itself calls JackNetMasterInterface::Exit(). This sends an UDP packet to the same process(?) with the KILL_MASTER flag set. Am i right in assuming such packets get only send by the jackd-master instance to itself?

In another thread of the same process the JackNetMasterManager::Run loop sees the KILL_MASTER packet and deletes the handle to the lost netjack2-connection-object... This seems quote odd, but the bad thing is: if this single UDP packet gets lost, the jackd-process ends up with a stale netjack2 handle and is unusable from this point on and does not recover? I know, netjack2 generally forgives packetloss not very well, but should it be that one single packet has such a vast impact? (XRuns are ok, but rendering the whole process useless is quite another class)

So, is there any obvious reason i miss that this was solved this way? Couldn't it be done in a more reliable way? What could this be? Sth. like a cleanup-queue which gets processed at the appropiate place? This could also be the first step in removing this "temporary" ( ;) ) ugly hack with the call to pthread_exit ;)

Thanks! Jan

sletz commented 8 years ago

This « kill » code was not very good since the beginning. Don’t be shy to write a better implementation.

Le 21 nov. 2016 à 18:31, jreusch-barco notifications@github.com a écrit :

hi i encountered a strange problem where on a jackd instance running as master (with alsa backend) some netjack2 client handles weren't closed(=destructed) properly.

on the master the JackNetMasterInterface::Recv(...) method calls JackNetMasterInterface::FatalRecvError() which itself calls JackNetMasterInterface::Exit(). This sends an UDP packet to the same process(?) with the KILL_MASTER flag set. Am i right in assuming such packets get only send by the jackd-master instance to itself?

In another thread of the same process the JackNetMasterManager::Run loop sees the KILL_MASTER packet and deletes the handle to the lost netjack2 handle... This seems quote odd, but the bad thing is: if this single UDP packet gets lost, the jackd-process ends up with an stale netjack2 handle and is unusable from this point on and does not recover? I know, netjack2 generally forgives packet loss not very well, but should it be that one single packet has such a vast impact? (XRuns are ok, but rendering the whole process useless is quite another class)

So, is there any obvious reason i miss that this was solved this way? Couldn't it be done in a more reliable way? What could this be? Sth. like a cleanup-queue which gets processed at the appropiate place? This could also be the first step in removing this "temporary" ( ;) ) ugly hack with the call to pthread_exit ;)

Thanks! Jan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

jreusch-barco commented 8 years ago

hi Stéphane i would really like to do so! but i could also use some "hints" ;) why is the JackNetSlaveInterface::FatalRecvError() code throwing a JackNetExceptionand the master-code isn't? are my assumptions right with the KILL_MASTER-signal?

sletz commented 8 years ago

Le 21 nov. 2016 à 18:31, jreusch-barco notifications@github.com a écrit :

hi i encountered a strange problem where on a jackd instance running as master (with alsa backend) some netjack2 client handles weren't closed(=destructed) properly.

on the master the JackNetMasterInterface::Recv(...) method calls JackNetMasterInterface::FatalRecvError() which itself calls JackNetMasterInterface::Exit(). This sends an UDP packet to the same process(?) with the KILL_MASTER flag set. Am i right in assuming such packets get only send by the jackd-master instance to itself?

Yes.

In another thread of the same process the JackNetMasterManager::Run loop sees the KILL_MASTER packet and deletes the handle to the lost netjack2 handle... This seems quote odd, but the bad thing is: if this single UDP packet gets lost,

How a packet could be lost in a same process?

the jackd-process ends up with an stale netjack2 handle and is unusable from this point on and does not recover?

Have you ever seen the problem occurring ?

I know, netjack2 generally forgives packet loss not very well, but should it be that one single packet has such a vast impact? (XRuns are ok, but rendering the whole process useless is quite another class)

So, is there any obvious reason i miss that this was solved this way? Couldn't it be done in a more reliable way? What could this be? Sth. like a cleanup-queue which gets processed at the appropiate place? This could also be the first step in removing this "temporary" ( ;) ) ugly hack with the call to pthread_exit ;)

Well it is a bit ugly, but the point was : the thread calling JackNetMasterInterface::Recv needs to notify the thread running JackNetMasterManager::Run when a Recv fails. Using a special packet seems to be easy way to to that. ThreadExit logic could possibly be changed by throwing a exception that would be be catched by the receive thread in order to properly quit.

jreusch-barco commented 8 years ago

hi

the jackd-process ends up with an stale netjack2 handle and is unusable from this point on and does not recover?

Have you ever seen the problem occurring ?

yep, that's why i dug into this issue :) it happened randomly on some machines when netjack2 clients got rebooted. It was rare, but as it rendered the jackd-master-process useless was quite severe.

Well it is a bit ugly, but the point was : the thread calling JackNetMasterInterface::Recv needs to notify the thread running JackNetMasterManager::Run when a Recv fails. Using a special packet seems to be easy way to to that. As stated i would like/propose to change this and post a patch here soon. :)

ThreadExit logic could possibly be changed by throwing a exception that would be be catched by the receive thread in order to properly quit.

This is still unclear to me and i need to dig deeper into it. The Slave-methods already solve it this way by throwing an exception, there is no call to pthread_kill. It is unclear to me until now why the Master-methods do it this way...

jreusch-barco commented 8 years ago

This is still unclear to me and i need to dig deeper into it. The Slave-methods already solve it this way by throwing an exception, there is no call to pthread_kill. It is unclear to me until now why the Master-methods do it this way...

ok, got it ;)

jreusch-barco commented 7 years ago

hi there i now have an alpha version of a patch for this. It still needs some rework&definitely some discussion :) For now i just hooked up in the recv-loop of the JackNetMasterManager::Run method which runs on my system every two seconds due to the timeout given to the recv method. The problem i have: the jackd-process still gets unusable after some time. but instead of a few minutes now it takes up to 12 hours.

I would be glad to get some help regarding if i caused this with my patch or if it looks like this is another issue. The point where it crashes always starts with a log entry like

Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::TimedWait name = jack_sem.1000_default_jackslave already deallocated!!

and is then followed by only giving this error from this point on

Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:50 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:51 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:51 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!
Dec  2 08:38:51 jackmaster jackctl[9858]: JackPosixSemaphore::Signal name = jack_sem.1000_default_jackslave already deallocated!!

This also occurs if i don't "exit" with an Exception and leave the call to ThreadExit (which gives me the opinion that changing the return value of the process method from 0 to -1 in case of an exception isn't really the problem)

e: here is the current version of the patch patchv1.txt commit on github: c79111ba3c47a7a4ec836cb77b316b5733a08377