opencog / link-grammar

The CMU Link Grammar natural language parser
GNU Lesser General Public License v2.1
389 stars 119 forks source link

memory leak parsing a sentence (new one) #54

Closed agayardo closed 9 years ago

agayardo commented 9 years ago

The following code leaks memory:

#include <locale.h>
#include <stdio.h>
#include "link-grammar/link-includes.h"

int main()
{
    Dictionary    dict;
    Parse_Options opts;
    Sentence      sent;
    Linkage       linkage;
    char *        diagram;
    int           i, num_linkages;
    char *        input_string[] = {"Perhaps it is and perhaps it isnt."};

    setlocale(LC_ALL, "en_US.UTF-8");
    opts = parse_options_create();
    parse_options_set_max_null_count(opts, 10);
    parse_options_set_display_morphology(opts, 1);

    dict = dictionary_create_lang("en");
    if (!dict) {
        printf ("Fatal error: Unable to open the dictionary\n");
        return 1;
    }
    int qq = 0;
    while(++qq < 100)
    for (i=0; i<1; ++i) {
        sent = sentence_create(input_string[i], dict);
        sentence_split(sent, opts);
        num_linkages = sentence_parse(sent, opts);
        if (num_linkages > 0) {
            linkage = linkage_create(0, sent, opts);
            diagram = linkage_print_diagram(linkage, true, 800);
            linkage_free_diagram(diagram);
            linkage_delete(linkage);
        }
        sentence_delete(sent);
    }

    dictionary_delete(dict);
    parse_options_delete(opts);
    return 0;
}

The memory leak is specific to the parse_options_set_max_null_count(opts, 10); option and to the sentence being parsed -- note the missing apostrophe in "Perhaps it is and perhaps it isnt." -- imagine the time it took me to find this one in a large .NET project with various native code mixed in :)

Varlgrid's output is the following:

==36481== 415,008 (65,184 direct, 349,824 indirect) bytes in 97 blocks are definitely lost in loss record 6 of 6
==36481==    at 0x4C28BED: malloc (vg_replace_malloc.c:263)
==36481==    by 0x4E5C618: exalloc (utilities.c:505)
==36481==    by 0x4E35777: sentence_parse (api.c:409)
==36481==    by 0x400C88: main (in /home/dev/link-grammar-5.2.1/a.out)

Speaking of the parse options, could you please suggest the options that provide an optimal balance between performance and precision for offline parsing? The default linkage_limit of 1000 seems to provide rather different random results every time, which doesn't seem right. Could you suggest if there are any other options worth tweaking? The documentation is not very clear about what the various options mean unless you are very knowledgeable with the way the parser works.

linas commented 9 years ago

OK, so I just tried this, and I can't reproduce it, at least, not with gcc-4.8.2 I haven't tried clang...

1) Are you using 5.2.3 or 5.2.2 (I didn't fix anything between those two) ? 2) Are you sure you are not accidentally linking to an older version of the library, left over in /usr/local/lib or in /opt/somewhere or some other forgotten location?

linas commented 9 years ago

I also tried version 5.2.0 and can't reproduce it there either.

linas commented 9 years ago

FWIW -- if it depends on the apostrophe being there, then ... perhaps is spell-checker related. By default, the spell-guesser is enabled, and I'm not so sure that this is a good idea. The guesser will try alternative spellings ... For me, with aspell, it does not leak. I have not tried hunspell.

Try ./configure --disable-aspell --disable-hunspell

linas commented 9 years ago

ignore my comment about the spell checker: the valgrind output you posted does not implicate it. Beats me what's going on.

linas commented 9 years ago

Regarding random results: the parser does use a random number generator in a few rare spots. One of the primary places is when there is an overflow, and it finds zillions of parses. In this case, the rand gen randomly selects some of these to show. The randgen it is NOT reset after every sentence, so if you try it again, you'll get a different set. I believe that if you call srand(0) after every sentence, you'll get repeatable results. Alternately, there is a parse_options_get_repeatable_rand() function in the API, it should work as well.

Sentences without overflows are always deterministic.

Changing the linkage limit does not affect performance by more than 1%, for me. If the default bothers you, set it to at least 20 or 40 to deal with the long confusing, hard-to-parse sentences ...

linas commented 9 years ago

What I said about the random display may be wrong: I am looking at parse_options_set_repeatable_rand() and the default is set to true, which implies that even for large sentences, the behavior should be deterministic.

However, for windiows, there is perhaps a bug:

utilities.h:#define rand_r(seedp) rand()

so the seed is ignored, and you get random results. ...

ampli commented 9 years ago

When I tried this program, I found a dictionary-open related memory leak. It can be big only if there is a loop of dictionary open on the en dict. I opened issue #55 for it. I can send a fix if you tell me in which way to fix it.

There is a another small dictionary-open related memory leak, that may worth fixing due to a bug it may introduce. I will try to reproduce it and will then open another issue. Opened #56.

agayardo commented 9 years ago

The valgrind output says our leak is not related to opening a dictionary, and eventually it crashes on out of memory by just parsing the same sentence over and over again as in the code I posted.

We are reproducing this consistently on Windows, Mac, and Linux, with respective compiler versions MSVC2013.4, clang-600.0.56, and gcc 4.7.2.

I will try to look into it deeper to see why it leaks / doesn't reproduce.

agayardo commented 9 years ago

OK, I found what is leaking, sorry, it seems valgrind's output is a bit misleading probably because of compiler optimizations (I compiled the link-parser in default, release mode and it probably inlined away the function).

What leaks is the lkg->link_array created in void partial_init_linkage(Linkage lkg, unsigned int N_words) and probably all the other objects created in this function.

You should be able to easily see that by adding some manual counting to the respective functions:

// at global scope
int link_arrays_created = 0;
...
static void free_linkages(Sentence sent)
{
        size_t in;
        Linkage lkgs = sent->lnkages;
        if (!lkgs) return;

        for (in=0; in<sent->num_linkages_alloced; in++)
        {
                size_t j;
                Linkage linkage = &lkgs[in];
                exfree((void *) linkage->word, sizeof(const char *) * linkage->num_words);
                exfree(linkage->chosen_disjuncts, linkage->num_words * sizeof(Disjunct *));
                free(linkage->link_array);
                --link_arrays_created;
...
void partial_init_linkage(Linkage lkg, unsigned int N_words)
{
        lkg->num_links = 0;
        lkg->lasz = 2 * N_words;
        lkg->link_array = (Link *) malloc(lkg->lasz * sizeof(Link));
        printf("link_arrays: %d\n", link_arrays_created);
        ++link_arrays_created;

Now, if you run the test code in my first message above -- please note the example sentence text and the parse_options_set_max_null_count() option being set, as those are important -- you will see both the memory usage and the link_arrays_created counter growing infinitely, especially if you turn the condition in the topmost while into an infinite loop.

I re-tested this and it's reproducible on parser versions 5.2.1 and 5.2.3 on all 3 platforms.

linas commented 9 years ago

OK, I'm stumped. I cannot reproduce this. I put in the printfs, as shown, and, for me, it counts up to 7 then down to 0, over and over again. I'll experiment some more ...

linas commented 9 years ago

I copied your code, above, verbatim, into the directory tests/mem-leak.cc on github. Just right now, I compile it by hand: cc mem-leak.cc -o ml -llink-grammar If I run it with valgrind, I get no leaks, If I run it with the printfs above, I see it count up to 7 and down to 0 over and over. This is with gcc (Ubuntu 4.8.2-19ubuntu1) 4.8.2. Let me try clang now...

linas commented 9 years ago

I just tried clang:

Ubuntu clang version 3.5-1ubuntu1 (trunk) (based on LLVM 3.5)
Target: x86_64-pc-linux-gnu
Thread model: posix

Same result, no leak. I don't get it. @ampli do you have a different kind of system? Can you check?

linas commented 9 years ago

Here's the patch that I used to print:

diff --git a/link-grammar/api.c b/link-grammar/api.c
index 5ccc835..31d529b 100644
--- a/link-grammar/api.c
+++ b/link-grammar/api.c
@@ -404,6 +404,8 @@ void parse_options_reset_resources(Parse_Options opts) {
 *
 ****************************************************************/

+int XXX = 0;
+
 static Linkage linkage_array_new(int num_to_alloc)
 {
        Linkage lkgs = (Linkage) exalloc(num_to_alloc * sizeof(struct Linkage_s)
@@ -423,6 +425,8 @@ static void free_linkages(Sentence sent)
                Linkage linkage = &lkgs[in];
                exfree((void *) linkage->word, sizeof(const char *) * linkage->n
                exfree(linkage->chosen_disjuncts, linkage->num_words * sizeof(Di
+XXX--;
+printf("duude %d free %p\n", XXX, linkage->link_array);
                free(linkage->link_array);

                /* Q: Why isn't this in a string set ?? A: Because there is no
@@ -542,6 +546,8 @@ void partial_init_linkage(Linkage lkg, unsigned int N_words)
        lkg->lasz = 2 * N_words;
        lkg->link_array = (Link *) malloc(lkg->lasz * sizeof(Link));
        memset(lkg->link_array, 0, lkg->lasz * sizeof(Link));
+printf("duude %d mall %p\n", XXX, lkg->link_array);
+XXX++;

        lkg->num_words = N_words;
        lkg->cdsz =  N_words;
linas commented 9 years ago

and here's a typical print:

duude 0 mall 0x1f8c450
duude 1 mall 0x1f8fb20
duude 2 mall 0x202fc30
duude 3 mall 0x1ff1170
duude 4 mall 0x1fffab0
duude 5 mall 0x1f9b590
duude 6 mall 0x2009680
duude 7 mall 0x1fecf80
duude 7 free 0x1fecf80
duude 6 free 0x1ff1170
duude 5 free 0x1fffab0
duude 4 free 0x1f9b590
duude 3 free 0x2009680
duude 2 free 0x1f8c450
duude 1 free 0x1f8fb20
duude 0 free 0x202fc30

Each address appears in each case. The appear in a different order, due to the sorting of the linkages by probability of being correct.

linas commented 9 years ago

ahhh got it... I had to ../configure --disable-aspell to see it. For me, aspell was running, and was inserting an apostrophe, which hid the problem.

This is curious . a data-dependent leak.

linas commented 9 years ago

OK, so I just created a version 5.2.4 that fixes this; its now up on the website.

Artem, thank you for the bug reports, and the patience. This particular leak was the result of the recent large linkage-handling re-design, it was obvious once it became visible.

linas commented 9 years ago

Oh great. The website is not updating. Its still showing version 5.2.1. Arghh. Anyway, the newer version are up on the ftp site here: http://www.abisource.com/downloads/link-grammar/

agayardo commented 9 years ago

No Linas, thank YOU for your fixes, they are of great help to our project :)