Closed mamarjan closed 12 years ago
Here is what I get in gdb:
Program received signal SIGSEGV, Segmentation fault. 0x080a26b0 in rt_finalize_gc () (gdb) where
(gdb)
Nice
As I'm primarily developing a parser and not a validator, I'll leave this for later.
Just a guess. get_and_skip_next_field can set input string to null, and if it's accessed after that, segfault occurs. Since I don't see any null checks surrounding its calls, maybe it's just a typo, and you meant empty string instead of null?
It should not affect GC. And the seg fault happens in GC. A null array in D is an array with zero length, not a null reference or pointer, so the runtime should check the bounds of the null array and thow an exception, not a seg fault.
I tried to pinpoint the problem and tried to replicate it in a small application, but didn't succeed. The problem happens when the file has issues and when exceptions are thrown in the validation code for a few tens of bad lines in GFF3 file. I will either change that code to work without exceptions and with a lot of ifs, or ask in the forums if anybody is interested in investigating this further.
Like I said, I'm not using pointers, and don't see how the code could have caused a seg fault. Maybe through a library function which is using pointers?
Can you set hooks to items being garbage collected? That may give more info. Or force the garbage collection at certain points in your code. I can see it is tricky to find (as usual with multi threaded), but you can try and get some control. I do suggest to fix this problem, as it could be an assumption you are making - and that would be worth finding out. I agree it could be a library function, but we should be comfortable enough with the tool chain to point it out.
If you give more information on how to recreate the problem, we could try and run it ourselves.
The following file is a problem for the validation code:
ftp://ftp.wormbase.org/pub/wormbase/species/m_hapla/gff/m_hapla.current.annotations.gff3.gz
When running the following on the current code from master, there is a segfault:
rake benchmark ./benchmark m_hapla.current.annotations.gff3 -v
The "score" fields in this file are bad and that makes the validate_score() function throw a lot of exceptions, mostly because to!double(score) throws an error. When I insert GC.collect() at the start of this function, no seg fault happens, but the parser is very slow :)
Interestingly, I don't see this issue with 64-bit code. With 32-bit, the behaviour is as you describe. This is definitely a bug in GC.
My hunch is that it has to do with rethrowing exceptions, i.e. filling the stack with exception references. Exceptions require adding stack info to allow rewinding later. That would explain slowness, first. Also, the GC somehow gets confused in 32-bits, maybe it runs out of stack space. Can you reproduce my hunch, so we can file a bug with the D authors?
Well, I do see it on 64bit on my laptop.
I already tried to reproduce the bug with simpler code, but didn't succeed. I was using nested try...catch statements, catching, rethrowing, but no seg fault. I will continue experimenting on a different branch, but this time reducing the code until I get less code or until the seg fault stops happening. While experimenting on Saturday I was already skipping the parsing, and instead of to!duble() in validat_score, I was simply just throwing the exception and the seg fault was still happening. So in this way I think I should be able to reduce the code significantly and prepare it for a bug report.
But in parallel I would like to make a D runtime library with debug symbols included, so that I can point to a specific line in GC where the seg fault occurs (e.g. which pointer is invalid).
Thx for the comments.
I've been reducing the example and I'm currently at 42 lines :)
https://gist.github.com/2911818
You can compile it using "dmd seg_fault.d", 32bit and 64bit, both segfault. I've tried a bunch of different things, but it seems I'm not able to get to a shorter example. Some examples:
Please take a look and let me know what you think. It seems still a bit too big and complicated for a bug report. But I'm out of ideas. For today at least :)
Temporarily solved by setting the array size to 8176 bytes.
Also submitted a bug report on the Dlang website:
http://forum.dlang.org/thread/bug-8232-3@http.d.puremagic.com%2Fissues%2F
When parsing the 233MB m_hapla testfile from Wormbase, the tool crashes with a segmentation fault after a number of parsed lines.