Open pat227 opened 4 years ago
This looks like a bug in a C binding. Are you using any other library?
I am. I am attempting to use strace to get more details. I am using ocaml bindings for mysql and postgres, in a separately compiled and linked ocaml project. Said project also uses Core and Core.Unix specifically to perform some basic file manipulations.
I suppose I could also ditch Core.Unix for Pervasives and see if that helps. But it could also be the bindings for the databases, although I have been using those far longer without issue. Therefore I suspect Core.Unix, or my use if it, is the trouble.
On Mon, Feb 24, 2020, 10:28 Jerome Vouillon notifications@github.com wrote:
This looks like a bug in a C binding. Are you using any other library?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ocsigen/ocsigenserver/issues/184?email_source=notifications&email_token=AAXVCNXOBDA2WBSMA7HXT6LREPRRNA5CNFSM4KZI4HTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMYINVA#issuecomment-590382804, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXVCNSCAPWVHBK3JY5QZGDREPRRNANCNFSM4KZI4HTA .
It's hard to debug this kind of issue since the crash happens during garbage collection, well after the buggy code has been executed.
You may be able to narrow down the issue by forcing garbage collections by inserting calls to Gc.minor ()
in your code.
Core.Unix and Lwt are rather solid. I would not expect a bug there.
Perhaps it's me.
I need to serialize requests from a web interface to do some task. I accomplish this by using a lock file, specifically Core.Unix.flock. I also was writing to the file some details of what was going on. Originally I wrote this within a with_file function. That was a mistake. My first backtrace showed the error in there, specifically on Core_Unix__with_close.
I use the db bindings for maintaining a queue and a history.
I eliminated the use of with_file after I realized I really should create a file descriptor, use that file descriptor for locking, writing, and unlocking. There's not much passing around of this descriptor. And I realize how fraught with race conditions is the use of lock files. I am no expert at it. But my use is very simple: if the file is already locked, just push to the queue and exit, else process the queue.
So that brought me to the latest backtrace that makes no mention of anything other than some C binding without a clue as to where. Although my use of flock is now suspect. I also use Core.Unix.fork_exec but that appears to be ok, or at least I have no reason to suspect it or my use of it.
Thanks for the pointers.
On Mon, Feb 24, 2020, 11:35 Jerome Vouillon notifications@github.com wrote:
It's hard to debug this kind of issue since the crash happens during garbage collection, well after the buggy code has been executed. You may be able to narrow down the issue by forcing garbage collections by inserting calls to Gc.minor () in your code. Core.Unix and Lwt are rather solid. I would not expect a bug there.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ocsigen/ocsigenserver/issues/184?email_source=notifications&email_token=AAXVCNR5AUOBP2ZEY5OVQKLREPZN5A5CNFSM4KZI4HTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMYSCCA#issuecomment-590422280, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXVCNTYSJXKZBJEYO6OWKLREPZN5ANCNFSM4KZI4HTA .
It's been over a year but I wanted to leave an update: I discovered that sprinkling GC.full_major () liberally throughout just one of my dependencies ( a module that utilizes Postgresql ) appears to have fixed this issue. I added the garbage collection calls almost only within functions where I actually utilize Postgresql, such as new Postgresql.connection, #exec, #status, etc. I cannot imagine I am the only person who sees this. I must be abusing the Postgresql package somehow but I don't see how. This workaround is ugly and merely discovered through guessing. I have no idea where the problem lies or what it is, but instead I know I can ward it off with lots of garbage collection. I may have some time in the future to narrow it down, removing some of those collections, and possibly use only minor collection, to see if anything makes a difference and where.
Below is a backtrace. I can also provide the memory map if that would be of any help. I have an instance of ocsigenserver running a small eliom project and it consistently crashes after a short while constantly. This is very disappointing and I have never seen anything like this. This is running on ocaml 4.06.1, eliom 6.7.0, js_of_ocaml 3.4.0, lwt 4.2.1, and ocsigenserver 2.14.0.
Could it be that I have caused this failing behavior through use of lwt? Or somehow? Or is this a bug in ocsigenserver?
Error in `/home/admin/.opam/4.06.1/bin/ocamlrun': double free or corruption (fasttop): 0x00007fdf78009860 ======= Backtrace: ========= /lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7fdfa4730bfb] /lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7fdfa4736fc6] /lib/x86_64-linux-gnu/libc.so.6(+0x7780e)[0x7fdfa473780e] /home/admin/.opam/4.06.1/bin/ocamlrun(caml_empty_minor_heap+0x126)[0x5605e5373a56] /home/admin/.opam/4.06.1/bin/ocamlrun(caml_gc_dispatch+0x4b)[0x5605e5373eab] /home/admin/.opam/4.06.1/bin/ocamlrun(caml_interprete+0x22a4)[0x5605e538dd34] /home/admin/.opam/4.06.1/bin/ocamlrun(caml_callbackN_exn+0xb9)[0x5605e5385bc9] /home/admin/.opam/4.06.1/bin/ocamlrun(caml_callback_exn+0x18)[0x5605e5385c28] /home/admin/.opam/4.06.1/lib/ocaml/stublibs/dllthreads.so(+0x2b09)[0x7fdfa3a42b09] /lib/x86_64-linux-gnu/libpthread.so.0(+0x74a4)[0x7fdfa4a664a4] /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fdfa47a8d0f]