Closed ISNIT0 closed 5 years ago
Well.... There is a small memoryleak but it is "only" about 400Mo. The other memory usage is more a "normal" use of the memory. All the rest of the comment is made with the memoryleak in libzim fixed.
zimWriterCreator need to keep a reference (a structure somewhere) to the dirent.
The dirent is small structure that store the metadata about the article. It store the article title, url, aid, the cluster and blob index, ...) It doesn't store the article data, but it is still some data.
The size of a zim::writer:Dirent
is 200 bytes.
As we create 10000000 articles, we end with 10000000*200bytes=1,86Gb.
On my computer, memory usage of node test-mem-leak
slowly increase to ends with 2Gb. (2.4Gb without the fix) and it seems coherent with the memory we need. The extra 140Mo can be explain by some other memory allocation (32Mo are simply allocated by lzma and 22Mo are bellow valgrind threshold) and memory display rounding.
There is not a lot we can do here, or a least "not easily". We need to keep all those structure because we need to post process the list of article:
@mgautierfr So, to be clear:
There was a memory leak in libzim, but it has now been fixed?
If the above is true, we can do a new release of node-libzim
once libzim
has been released which will solve this issue, and then a new release of mwoffliner
which will fix https://github.com/openzim/mwoffliner/issues/706
@ISNIT0, yes it is a memory leak in libzim. But it is about a 16% of the memory use. The memory will still increase linearly, this reduce the slope of the line (and we may do better) but it will not change the fact the it is a linear increase.
So there will always be a linear increase, no-matter what is done on the node-libzim
side?
@kelson42 This means that there will always be a linear increase of memory in mwoffliner. Is this okay?
@ISNIT0 I suspect your test is not really representative of the problem of https://github.com/openzim/mwoffliner/issues/706. no.wikipedia.org has only half a million of articles and we talk about 9GB of memory consumption (4x more than in your example)... so you probably need to update the test.
The test use relatively short url. As c++ std::string
stores short string in the std::string
structure directly there is no need for more memory that the dirent itself (the string is already count in the 200bytes). In a real case, the url will be longer and the std::string
will need to allocate another memory to store the string. And so, the memory usage will be bigger.
What we have here is more a "minimum memory usage" case.
@mgautierfr I've updated the repo.
You can now run:
tsc && npm run test -- --numArticles=100000
It will execute src/test.ts
The key thing is that for each article it will create images and redirects:
const numRedirects = faker.random.number({ min: 0, max: 10 });
const numImgs = faker.random.number({ min: 0, max: 5 });
AFAIK most of this this should now be fixed in libzim 5.0.0. A new ticket should be open if we still have something to work on here.
It seems
node-libzim
grows go use a lot of memory quite quickly when adding articles.The problem seems to disappear when I comment out this line: https://github.com/openzim/node-libzim/blob/master/src/ZimCreator.cc#L53 (Which actually calls into
libzim
) I'm not sure If we need to explicitly delete something onnode-libzim
after the call toaddArticle()
To reproduce the issue, clone the repo and run:
npm i
npm run test-mem-leak
I'm seeing the memory usage get to ~1GB within 10-15 seconds
@mgautierfr Do you have any ideas/suggestions?