openzim / node-libzim

Libzim binding for Node.js: read/write ZIM files in Javascript
https://www.npmjs.com/package/@openzim/libzim
GNU General Public License v3.0
27 stars 11 forks source link

Memory Leak #10

Closed ISNIT0 closed 5 years ago

ISNIT0 commented 5 years ago

It seems node-libzim grows go use a lot of memory quite quickly when adding articles.

The problem seems to disappear when I comment out this line: https://github.com/openzim/node-libzim/blob/master/src/ZimCreator.cc#L53 (Which actually calls into libzim) I'm not sure If we need to explicitly delete something on node-libzim after the call to addArticle()

To reproduce the issue, clone the repo and run:

I'm seeing the memory usage get to ~1GB within 10-15 seconds

@mgautierfr Do you have any ideas/suggestions?

mgautierfr commented 5 years ago

Well.... There is a small memoryleak but it is "only" about 400Mo. The other memory usage is more a "normal" use of the memory. All the rest of the comment is made with the memoryleak in libzim fixed.

zimWriterCreator need to keep a reference (a structure somewhere) to the dirent. The dirent is small structure that store the metadata about the article. It store the article title, url, aid, the cluster and blob index, ...) It doesn't store the article data, but it is still some data. The size of a zim::writer:Dirent is 200 bytes. As we create 10000000 articles, we end with 10000000*200bytes=1,86Gb.

On my computer, memory usage of node test-mem-leak slowly increase to ends with 2Gb. (2.4Gb without the fix) and it seems coherent with the memory we need. The extra 140Mo can be explain by some other memory allocation (32Mo are simply allocated by lzma and 22Mo are bellow valgrind threshold) and memory display rounding. There is not a lot we can do here, or a least "not easily". We need to keep all those structure because we need to post process the list of article:

ISNIT0 commented 5 years ago

@mgautierfr So, to be clear: There was a memory leak in libzim, but it has now been fixed? If the above is true, we can do a new release of node-libzim once libzim has been released which will solve this issue, and then a new release of mwoffliner which will fix https://github.com/openzim/mwoffliner/issues/706

mgautierfr commented 5 years ago

@ISNIT0, yes it is a memory leak in libzim. But it is about a 16% of the memory use. The memory will still increase linearly, this reduce the slope of the line (and we may do better) but it will not change the fact the it is a linear increase.

ISNIT0 commented 5 years ago

So there will always be a linear increase, no-matter what is done on the node-libzim side?

@kelson42 This means that there will always be a linear increase of memory in mwoffliner. Is this okay?

kelson42 commented 5 years ago

@ISNIT0 I suspect your test is not really representative of the problem of https://github.com/openzim/mwoffliner/issues/706. no.wikipedia.org has only half a million of articles and we talk about 9GB of memory consumption (4x more than in your example)... so you probably need to update the test.

mgautierfr commented 5 years ago

The test use relatively short url. As c++ std::string stores short string in the std::string structure directly there is no need for more memory that the dirent itself (the string is already count in the 200bytes). In a real case, the url will be longer and the std::string will need to allocate another memory to store the string. And so, the memory usage will be bigger. What we have here is more a "minimum memory usage" case.

ISNIT0 commented 5 years ago

@mgautierfr I've updated the repo.

You can now run:

tsc && npm run test -- --numArticles=100000

It will execute src/test.ts The key thing is that for each article it will create images and redirects:

const numRedirects = faker.random.number({ min: 0, max: 10 });
const numImgs = faker.random.number({ min: 0, max: 5 });
kelson42 commented 5 years ago

AFAIK most of this this should now be fixed in libzim 5.0.0. A new ticket should be open if we still have something to work on here.