mamarjan / gff3-pltools

A fast parallel GFF3 parser
MIT License
15 stars 5 forks source link

Benchmark tool seg-fault #31

Closed mamarjan closed 12 years ago

mamarjan commented 12 years ago

When parsing the 233MB m_hapla testfile from Wormbase, the tool crashes with a segmentation fault after a number of parsed lines.

mamarjan commented 12 years ago

Here is what I get in gdb:

Program received signal SIGSEGV, Segmentation fault. 0x080a26b0 in rt_finalize_gc () (gdb) where

0 0x080a26b0 in rt_finalize_gc ()

1 0x080a1155 in gc.gcx.Gcx.fullcollect() ()

2 0x080a0b4e in gc.gcx.Gcx.fullcollectshell() ()

3 0x080a04ea in gc.gcx.Gcx.bigAlloc() ()

4 0x0809e66e in gc.gcx.GC.mallocNoSync() ()

5 0x0809e50f in gc.gcx.GC.malloc() ()

6 0x08094e5e in gc_malloc ()

7 0x0809638d in _d_newclass ()

8 0x080acfb7 in core.runtime.defaultTraceHandler() ()

9 0x0809cfc9 in _d_traceContext ()

10 0x08095c44 in _d_createTrace ()

11 0xbffff21c in ?? ()

(gdb)

pjotrp commented 12 years ago

Nice

mamarjan commented 12 years ago

As I'm primarily developing a parser and not a validator, I'll leave this for later.

lomereiter commented 12 years ago

Just a guess. get_and_skip_next_field can set input string to null, and if it's accessed after that, segfault occurs. Since I don't see any null checks surrounding its calls, maybe it's just a typo, and you meant empty string instead of null?

mamarjan commented 12 years ago

It should not affect GC. And the seg fault happens in GC. A null array in D is an array with zero length, not a null reference or pointer, so the runtime should check the bounds of the null array and thow an exception, not a seg fault.

I tried to pinpoint the problem and tried to replicate it in a small application, but didn't succeed. The problem happens when the file has issues and when exceptions are thrown in the validation code for a few tens of bad lines in GFF3 file. I will either change that code to work without exceptions and with a lot of ifs, or ask in the forums if anybody is interested in investigating this further.

Like I said, I'm not using pointers, and don't see how the code could have caused a seg fault. Maybe through a library function which is using pointers?

pjotrp commented 12 years ago

Can you set hooks to items being garbage collected? That may give more info. Or force the garbage collection at certain points in your code. I can see it is tricky to find (as usual with multi threaded), but you can try and get some control. I do suggest to fix this problem, as it could be an assumption you are making - and that would be worth finding out. I agree it could be a library function, but we should be comfortable enough with the tool chain to point it out.

If you give more information on how to recreate the problem, we could try and run it ourselves.

mamarjan commented 12 years ago

The following file is a problem for the validation code:

ftp://ftp.wormbase.org/pub/wormbase/species/m_hapla/gff/m_hapla.current.annotations.gff3.gz

When running the following on the current code from master, there is a segfault:

rake benchmark ./benchmark m_hapla.current.annotations.gff3 -v

The "score" fields in this file are bad and that makes the validate_score() function throw a lot of exceptions, mostly because to!double(score) throws an error. When I insert GC.collect() at the start of this function, no seg fault happens, but the parser is very slow :)

lomereiter commented 12 years ago

Interestingly, I don't see this issue with 64-bit code. With 32-bit, the behaviour is as you describe. This is definitely a bug in GC.

pjotrp commented 12 years ago

My hunch is that it has to do with rethrowing exceptions, i.e. filling the stack with exception references. Exceptions require adding stack info to allow rewinding later. That would explain slowness, first. Also, the GC somehow gets confused in 32-bits, maybe it runs out of stack space. Can you reproduce my hunch, so we can file a bug with the D authors?

mamarjan commented 12 years ago

Well, I do see it on 64bit on my laptop.

I already tried to reproduce the bug with simpler code, but didn't succeed. I was using nested try...catch statements, catching, rethrowing, but no seg fault. I will continue experimenting on a different branch, but this time reducing the code until I get less code or until the seg fault stops happening. While experimenting on Saturday I was already skipping the parsing, and instead of to!duble() in validat_score, I was simply just throwing the exception and the seg fault was still happening. So in this way I think I should be able to reduce the code significantly and prepare it for a bug report.

But in parallel I would like to make a D runtime library with debug symbols included, so that I can point to a specific line in GC where the seg fault occurs (e.g. which pointer is invalid).

Thx for the comments.

mamarjan commented 12 years ago

I've been reducing the example and I'm currently at 42 lines :)

https://gist.github.com/2911818

You can compile it using "dmd seg_fault.d", 32bit and 64bit, both segfault. I've tried a bunch of different things, but it seems I'm not able to get to a shorter example. Some examples:

  1. When I comment the call to try_and_catch(), there is no seg fault (this was expected),
  2. When I remove the ~= operator in "line ~= something" (see gist), for example when I replace it for "line = line ~ something", there is no segfault,
  3. When I put 8176 or smaller as the array size in the new statement( see gist), there is no segfault. 8177 or higher produces the segfault. 8176+16=8192==8kB, seems something with 16 bytes in size is creating a problem here :)

Please take a look and let me know what you think. It seems still a bit too big and complicated for a bug report. But I'm out of ideas. For today at least :)

mamarjan commented 12 years ago

Temporarily solved by setting the array size to 8176 bytes.

Also submitted a bug report on the Dlang website:

http://forum.dlang.org/thread/bug-8232-3@http.d.puremagic.com%2Fissues%2F