openzim / sotoki

StackExchange websites to ZIM scraper
https://library.kiwix.org/?category=stack_exchange
GNU General Public License v3.0
217 stars 25 forks source link

Memory Leak (XML parsing, Templates, libzim ?) #243

Open rgaudin opened 3 years ago

rgaudin commented 3 years ago

Bounty Hunters: See https://github.com/openzim/sotoki/issues/243#issuecomment-1170237942 for additional details

When running sotoki on very large domain such as StackOverflow, memory becomes an issue. A full run (without images) on athena worker (which has slow CPU and disk) consumed about everything of its 96GiB configured RAM.

Screen Shot 2021-09-10 at 08 12 05

Note: because it took so long to complete, we've lost most of the netdata graph… Another run on a faster worker lasted half the time (but we have no RAM usage graph).

This is obviously unexpected and unsustainable. The expected RAM usage for stackOverflow is 36-38GiB (of which 15-18GiB) is used by redis.

While it allowed us to discover some memory issues in libzim and scraperlib, it is clear that sotoki is leaking most of its generated content.

Bellow is a graph of memory consumption over a full run when using a fake zim creator (so no data written to ZIM nor disk).

full-sotoki-nozim

Don't look at the initial spike which is debug artefact. While the absolute values are small (max of 215MiB), the curve should be flat. Here, we're seeing a 70% RAM increase.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 commented 2 years ago

@rgaudin @mgautierfr Maybe this new opensource tool would help to understand what is going on? https://github.com/bloomberg/memray

Popolechien commented 2 years ago

Note: a bounty can now be claimed for solving this issue. https://app.bountysource.com/issues/109534260-leaking-memory

TheCrazyT commented 2 years ago

I guess you already made shure that no process is hanging, too? (I guess a graph with active running processes might also help) I just took a quick look through the code and saw some subprocess.run, but sadly you log no info about those (when they are started and finished). Maybe there exist some methods with timeout that you could use to make shure they finish in a defined time ...

PS: It could also be that "zimscraperlib" is leaking, but I didn't take a look at it for now ...

rgaudin commented 2 years ago

@TheCrazyT those processes happen early on in the whole scraping process and complete before we start the intensive work. Now that we're expecting outsiders to take a look at this, I'll rewrite the issue in the coming day(s) with as much details as possible. Yes https://github.com/openzim/python-scraperlib is a possible culprit.

TheCrazyT commented 2 years ago

I'm currently looking at: https://githubmemory.com/repo/openzim/python-libzim/issues/117?page=2 Does that mean that "4. Observe that memory consumption is stable" is wrong because you still have problems? If you let the Proof of concept-script on that site run long enough(with a bigger dataset), would you get the same curve?

If your answer is yes, then I'm wondering if it could be a python internal problem. Somewhere inside xml.sax.handler. Atleast it would also explain why most of the process is leaking. But the impact must be really really small, because you won't notice it on short runs.

rgaudin commented 2 years ago

Here's a more complete description for those not familiar with sotoki and this issue.

Here's a graph that show how it looks like overtime (was an OOM'd run with ~60GB or RAM)

Screen Shot 2021-08-30 at 08 43 52

Huge spike at the begining is the preparation step. Then you can see sotoki processing the Users (storing info in redis and creating users pages in ZIM) followed by a drop in usage: this is when we release some of the info that we won't need anymore (see :cleanup_users(), :purge() and :defrag_external() in sotoki.utils.database). Then it starts processing the questions/answers and the RAM increases linearly until either it OOMs or completes.

Latest stackoverflow run using 2022-03 dumps used approx. 125GB of RAM (run wasn't monitored)


What the RAM consumption shows is that the entirety of the content that we read/write (that is unclear because we both transform and compress while making the ZIM) goes and stays into RAM.

Although Python doesn't releases memory back to the system, this memory is not available anymore as we hit OOM on systems without enough memory.

It seems (no evidence but match of XML size vs RAM usage) that the size of images (fetched online and added to the ZIM) is not impacting RAM consumption which may indicate that the leak doesn't affect binary item addition in scraperlib/libzim.

TheCrazyT commented 2 years ago

Talking about the sax parser ... It is probably better to execute close after you are done:

import xml.sax
parser = xml.sax.make_parser()

#...

try:
  parser.close()
except xml.sax.SAXException:
  #not shure if anything can be done here
  #just ignoring this error for now ;-)
  pass

I tested it in google colab and the info about the close-function says the following:

This method is called when the entire XML document has been
passed to the parser through the feed method, to notify the
parser that there are no more data. This allows the parser to
do the final checks on the document and empty the internal
data buffer.

The parser will not be ready to parse another document until
the reset method has been called.

close may raise SAXException.

(same info as here : https://github.com/python/cpython/blob/main/Lib/xml/sax/xmlreader.py#L144 )

The default parser ExpatParser implements IncrementalParser and seem to use a "pyexpat C module". Atleast according to that comment: https://github.com/python/cpython/blob/f4c03484da59049eb62a9bf7777b963e2267d187/Lib/xml/sax/expatreader.py#L88

I also didn't find anything that executes the close automatically. So my guess is that this "pyexpat C module" still holds some resources (probably not the whole document) and memray can't figure it out because it is no pure python.

TheCrazyT commented 2 years ago

It is still possible that the xml library is leaking. Atleast i created a proof of concept with a jupyter notebook:

https://gist.github.com/TheCrazyT/376bb0aac2d283f9ba505b889ef5da51

Bad thing is ... I currently can't really figure out what the problem is, but it seems to be caused by the html-entities in combination with long text inside the body-attribute of the row-tag. I was hoping that it is just a special entity that can cause problems.

Probably will try to create a better proof of concept with a memory-profiler ... just to be shure.

Edit: Oh well profiling didn't show the result that I was expecting. Somehow I still evaluate this wrong, there seems to be another condition i can not figure out. For example I just duplicated the row and numbered it, because i was expecting that the other rows should also stay in memory. But for some reason only content of the first row stayed in memory, although the content of the other rows were nearly the same. I'm a bit clueless at the moment ...

Edit again: Guess that was just a red hering ... the text stays in memory for some reason, but you can't really increase the memory usage by it. Atleast i was not able to produce a case where the following rows would also stay in memory. So you can only increase memory usage if your first row is very very very large, and that won't match the described behaviour ...

Edit 04.07.2022: Now analysing a core-dump of a full runthrough with a smaller dataset ("astronomy"). There seem to be many "NavigableString" left from beautifulsoup, but not shure if that means anything. Also some urls seem to be left in memory when comparing multiple runs. Although I already checked by memory profiling beautifulsoup-commands themself, there might be something special when it could leak memory. (According to https://bugs.launchpad.net/beautifulsoup ... there seem also to be problems about "recursion depth" , not shure if such errors could also lead to unclean memory)

Edit 06.07.2022: Just another jupyter notebook to show that memory is leaking: https://gist.github.com/TheCrazyT/117dd8371f63708e6f655e444f4bfaa4 This time by executing sotoki itself and with garbage-collection near the sys.exit . I guess it is kinda strange that text of row number 12409 of 12504 is still within the dump. The last row would be understandable ... but that entry is kinda old. Also had another test-notebook that showed that there are still about 24097 dict-objects, 13144 list-objects and 11848 tuple-objects referenced within the core-dump. Not so shure if I could figure out how/where those a still referenced. My guess is that maybe some are still indirectly referenced by the Global-object.

TheCrazyT commented 2 years ago

Guess I should run a compared mem-profile with python 3.10 ,because they seem to have changed something about garbage collection within this commit:

https://github.com/python/cpython/commit/59af59c2dfa52dcd5605185263f266a49ced934c

Hm weird, just noticed that the tag on the commit is v3.11.0b3. (although somewhere else it said 3.10) So maybe it is not released yet ...

Edit: Well alot of packes just won't work with python 3.11.0b3. Only option left is get the code of python3.8, cherry pick that commit and cross fingers that it is compatible 😞.

Update 09.07.2022: Was now running a patched python version 3.8 with the following patch: https://gist.github.com/TheCrazyT/db4323d9cb3d3cc71b2a69020b1fa535 With the "blender"-dataset the peak memory dropped by about 1.2 mb , but I guess more would be possible if there would be some more modifications that were done in python version 3.11 . (Atleast I still see improvement, since the curve by mprof still seems to increase). It is not that much, but compared to the dataset it might be good enough for the "stackoverflow" dataset. But because I do not have the resources I'm not able to let it run against that.

Guess the biggest problem will be to add the patch to the Dockerfile. The only versions i can think about will increase the size and build-time of the docker image dramatically, because you would need to compile 2 python extensions.(so the image would need compile tools and compile time) Adding the binaries themself is probably also no good solution.

Update 11.07.2022: I'm still missing something, the current changes(commits below) do not have much impact. Not really satisfying tbh.

albassort commented 2 years ago

Sorry if this is too much to ask, but, I would like to take a look, though Jupiter Notebooks aren't quite helpful for me.

Can you please write a small python script, with a provided xml file (I recommend wikipedia's XML dumps, they're beef enough to illustrate trends like this)

Thank you!

TheCrazyT commented 2 years ago

@Retkid sorry, beside fishing strings in the core-dump and profiling the whole sotoki I have no simple script that profiles the memory.

I'm currently looking at the output of "log-malloc2" to figure out if there is something weird.(something allocated, but not freed)

Currently leads me to "lxml" inside "xmlBufGrow" , but that could also be a red hering.

rgaudin commented 2 years ago

Thanks for the updates. Missed those edits above as it doesn't trigger a notification.

FYI, a nopic (without any picture) run of stackoverflow is about to end. This code includes the parser.close() patch but as expected, there is no visible improvement.

It's important to note that the RAM usage is equivalent to the version with pictures, confirming that images are not involved, which can incidentally exclude libzim itself (for binary entries).

Screen Shot 2022-07-14 at 09 17 33
TheCrazyT commented 2 years ago

@rgaudin

Ok ... here is another thing I noticed. The commit of the redis-package, has actually a "command stack"( https://github.com/redis/redis-py/blob/c54dfa49dda6a7b3389dc230726293af3ffc68a3/redis/client.py#L2049 ) that gets emptied after a commit.

But here is the catch, the commit_maybe won't commit as othen as you would think it would: https://github.com/openzim/sotoki/blob/692f59b601f2681a76c37351c41f799bb6259c08/src/sotoki/utils/database.py#L496

A commit will happen if nb_seen is dividable throuhg 1000: https://github.com/openzim/sotoki/blob/692f59b601f2681a76c37351c41f799bb6259c08/src/sotoki/utils/database.py#L82

But since you increase it by: (4 + len(post.get("Tags", [])) it requires a bit of luck to get exactly to something that is divideable with 1000. What if you get 1001,1002,1003, 1004, 1005... after incrementing nb_seen with (4 + len(post.get("Tags", [])) ? It will need another loop till it could get lucky to hit 2000, etc.

So my guess is that the command_stack just increases and increases, because a commit does barely happen. (currenlty this theory has the problem that you do not see a drop in the graph at all)

To be honest, I would simply remove the + len(post.get("Tags", [])), but I'm not shure if it has any special meaning. Another solution would be to reset nb_seen to 1 if it is bigger than 1000 inside the should_commit and return self.nb_seen > self.commit_every .

rgaudin commented 2 years ago

Good point ; I'd go with self.nb_seen > self.commit_every as well as it seems safer.

rgaudin commented 2 years ago

Please submit a PR once you have a working version ; thanks 🙏

mgautierfr commented 2 years ago

I would go with a new last_commit and have something like :

if self.nb_seen > self.last_commit + self.commit_every:
    do_commit(...)
    self.last_commit += self.commit_every

This way we would not reset nb_seen and be able to use it for other statistics. (My two cents)


(currenlty this theory has the problem that you do not see a drop in the graph at all)

This is indeed a limitation of your theory (even if this is a good catch). On such a long run, we should see few drops times to times. We will see how memory usage behaves with this fix.

TheCrazyT commented 2 years ago

Well, If I saw it correctly, then every thread has its own nb_seen, starting from 0.

Maybe its just not noticeable if one thread is lucky and hits commit.

Update:

Oh btw. the reason I did not create a pull request yet is because I was just logging the "commit".

For some reason I still only get the output at the end of the "Questions"-Progress and that is driving me slightly insane atm.

Update again:

Just now noticed that the actual recording-phase is within "_QuestionsMeta". (within PostFirstPasser)

But that is near the beginning.

And this means: ... I probably still didn't find the main memory-leak problem I guess.

TheCrazyT commented 2 years ago

Still believe that the Rewriter that gets executed during the html-generation somehow causes a leak.

Sadly I'm not able to reproduce the problem if I just test the Rewriter on its own. But if you disable the rewrite method you that the memory-profile-curve seems to be more flat.

Already thought that BeatifulSoup may cause the leak because they do not seem to close the xml-parser in case of an error during the encoding detection: https://github.com/wention/BeautifulSoup4/blob/03a2b3a9d1fc5877212d9d382a512663f24c887d/bs4/builder/_lxml.py#L227

Although thats true ... it is not causing the leak. (Atleast I was not able to create an example that would trigger an UnicodeDecodeError error and I somehow doubt that it would cause a steady increase of memory on a normal dataset)

parvit commented 2 years ago

Hi all, seen that there did not seem to be progress on this i started looking at it, using the memray tool and running on a dump of the gaming stackexchange site (can't run anything more massive).

I've found that there does not seem to be a leak per se but instead a diverging memory usage behavior caused mainly by two culprits (both native memory so not easily found):

The Xapian behavior is especially bad as it indexes the entire text of the all the comments / questions / etc so it grows very fast by the number of words instead of the number of questions. The file compression disabled also helps with keeping down the memory usage as it stores the compression stream during the entire run and grows with the size of the document.

Both were enabled by default so with no way of disabling so i've proposed two PRs on sotoki and python-scraperlib to make these two functionalities optional and not enabled by default (enabling them specifically from the commandline).

The data about my run can be found at link and i'll keep them available until the ticket is open.

If you can please execute a run of this change on the main stackoverflow site so we can ascertain if my analysis is actually correct. thank you.

parvit commented 2 years ago

For reference, this is the usage with the current code in both projects: original

This is with my change and keeping both functionalities disabled: complete_fix

To my understanding and after viewing the memray reports it seems that the second graph corresponds to a normal behavior of the program.

rgaudin commented 2 years ago

@parvit ; thank you for your contribution but compression and full-text index are enabled by default on purpose, because that's what's wanted 99% of the time. It is still possible to disable it should it not be wanted.

Global.creator = Creator(
...
).config_verbose(True).config_indexing(False, Global.conf.iso_lang_3).config_compression("none")

I am skeptical about compression or indexing having a huge impact on memory as I believe those would work at most at ZIM-cluster level which is fixed size. So the size of the content should not matter much… unless there's a bug in there.

Finally, those features are essential to sotoki so we don't want to disable them.

I will try to run it with both disabled on a large SE site to confirm/infirm your hypothesis.

parvit commented 2 years ago

@rgaudin Sure you can check for yourself by looking at the analysis in the drive folder i've provided. In the original run 2/3 of the total memory were allocated by the compression + indexer.

I can make the change so as to keep it enabled by default and give the possibility of disable it if you feel it's better this way for the project.

parvit commented 2 years ago

@rgaudin I hope you do find at some point what you think is the bug however i always let myself be guided by data when doing these performance analysis so i don't think i'll contribute anything else to this because to me the data was very clear and the fix also.

Good luck.

kelson42 commented 2 years ago

@parvit IMO your are mixing two different things:

parvit commented 2 years ago

well i'll admit i have jumped to conclusions a bit too fast.

I'll be awaiting the results of your tests.

Il Mar 26 Lug 2022, 21:16 Kelson @.***> ha scritto:

@parvit https://github.com/parvit IMO your are mixing two different things:

  • The Sotoki options and default benaviour: we don't want to change it and we don't need to change it. This has nothing to do with the memory leak.
  • The memory leak itself, which is the topic. You have emitted a theory and provided evidences. We still have to redo the expermiment to see if we come to the same conclusion, but we don't need your PRs for that.

— Reply to this email directly, view it on GitHub https://github.com/openzim/sotoki/issues/243#issuecomment-1195882173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4EWVPHF2DU567C7L63VWA2SNANCNFSM5DY2ZIOA . You are receiving this because you were mentioned.Message ID: @.***>

rgaudin commented 2 years ago

@TheCrazyT PR #268 did not improve memory usage.

baseline redis-commit

Look at the durations as the second one's image is not complete (had to restart the backend)

Will try with @parvit suggestion tomorrow.

TheCrazyT commented 2 years ago

Thanks for testing.

That is a really disappointing result. I must admit that I always tested with the "FakeCreator" that you mentioned in your first comment and with a much smaller dataset.

I wish you luck in finding the leak.

Probably does not have something to do with python code itself but I'm currently out of ideas. It is also really difficult to create a simple and small proof of concept of the leak.

Update: Could also be some content related thing. Since lxml is indirectly used on the posts, some stuff like the "billion laughs expansion bug" might be a thing, but I was not able to test that, yet. Nope ... thats not a thing. Would probably cause way too much memory before a crash. Process won't run through with that (and if it would ... you will see a big drop after the memory allocation, so this is clearly not the situation here).

rgaudin commented 2 years ago

OK, here's the graph of the run without neither indexing nor compression.

baseline

baseline

nocomp-noindex

nocomp-noindex

This is very interesting. After the initial hill, the curve is descending only. Also, on this 5GB-capped run, regular code used up to 4.4GB (at the end) while this one peaked at 2.44GB, roughly 2GB less (45% of the total).

Thank you @parvit for bringing this to our attention. I need to process this and understand what is going on exactly now before jumping into conclusions.

full netdata report

parvit commented 2 years ago

Happy that my analysis was pointing in a useful direction.

kelson42 commented 2 years ago

@rgaudin @parvit It would be good to know id this is compression OR indexing the reason for hight memory consumption.

parvit commented 2 years ago

@kelson42 From my analysis both functionalities are the cause of high memory usage:

  1. Compression : The stream is compressed on the fly while the document is being built so we keep the memory associated alive and updated for the entire run of the program until write on disk
  2. Indexing : The index is updated on every question being scanned and also all the comments are added, it is in fact a full-text index so its memory is really going to grow always as the document is scanned.

For point 1 I don't know if this is possible but might be circumvented with compressing only at the end the file generated. For point 2 I don't really see any way around because its either on or off and grows with the number of words in total of the source document

This is assuming that my proposal to allow disabling on demand these two is not considered.

kelson42 commented 2 years ago

@parvit For 1, it does not work like this. If you find any proof that memory is not freed regularly (at the end of each cluster compression), then please point it.

parvit commented 2 years ago

@kelson42 I could not tell you if the memory is really being flushed at every cluster but my run still had 1/4 memory total allocated at the end (so maybe there's an actual leak there?).

image

You can find this in the drive folder posted earlier in file original.html in results.zip

It might be possible that in a longer run than mine the usage from compression is bounded as you say but we should then do a run with compression only and see if it aligns with the run where both are disabled

EDIT: For context the half on the left is the memory directly associated with python code, while on the right are the blocks of memory associated with native code of libzim

mgautierfr commented 2 years ago

Thanks all for the investigations.

Your remark @parvit is interesting. I've took a fresh look at the xapian indexing and it seems we were creating a xapian transaction preventing xapian to flush the data in the fs. I've made the PR https://github.com/openzim/libzim/pull/719 to fix that. @rgaudin it would be nice to test again with this fix.

However, we have tested the sotoki scrapper with a fake libzim (creating nothing) and we found that memory consumption was still increasing. So even if xapian is taking (too much) memory, there is probably another leak somewhere on the python side.

parvit commented 2 years ago

The data from my test did not evidence any clear leak on the python side but if you could indicate me how you setup the fake libzim, i may continue the investigation and hopefully find more on that.

Also my suggestion included also the compression change, did you also consider that one in your test?

Il giorno mer 3 ago 2022 alle ore 15:12 Matthieu Gautier < @.***> ha scritto:

Thanks all for the investigations.

Your remark @parvit https://github.com/parvit is interesting. I've took a fresh look at the xapian indexing and it seems we were creating a xapian transaction preventing xapian to flush the data in the fs. I've made the PR openzim/libzim#719 https://github.com/openzim/libzim/pull/719 to fix that. @rgaudin https://github.com/rgaudin it would be nice to test again with this fix.

However, we have tested the sotoki scrapper with a fake libzim (creating nothing) and we found that memory consumption was still increasing. So even if xapian is taking (too much) memory, there is probably another leak somewhere on the python side.

— Reply to this email directly, view it on GitHub https://github.com/openzim/sotoki/issues/243#issuecomment-1203932146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWGJD4C7NTTNM7PTU5YGT4DVXJV2JANCNFSM5DY2ZIOA . You are receiving this because you were mentioned.Message ID: @.***>

mgautierfr commented 2 years ago

The fakeCreator is here : https://github.com/openzim/sotoki/blob/v0.0.0.1/src/sotoki/utils/shared.py#L20 @rgaudin may give you more information to use it but you mostly have to use it instead of the imported Creator is https://github.com/openzim/sotoki/blob/master/src/sotoki/utils/shared.py

Also my suggestion included also the compression change, did you also consider that one in your test?

I had a look on the compression side but haven't found anything obvious.

rgaudin commented 2 years ago

Here's the no-comp run that still had indexing. This is strangely identical to the nocomp-noindex one. I am starting a no-index run before testing @mgautierfr's branch.

nocomp
parvit commented 2 years ago

@rgaudin Could you send me indications on how to use the fake creator (maybe via mail)? thank you.

rgaudin commented 2 years ago

Here's the noindex version, confirming that indexing has no visible impact on RAM usage but compression is memory intensive. @mgautierfr can you briefly explain how it works? What should it be compressing and how so we can assess whether this is just normal behavior that we incorrectly didn't expect or if there's a problem.

noindex
mgautierfr commented 2 years ago

This is probably not expected.

First, this is normal that libzim memory usage increase with item added. This is what I was explaining in https://github.com/openzim/python-libzim/issues/117. I suspect that the slow slop increasing at beginning and end is due to normal memory increasing.

But the big slop in the middle is surprising.

How compression is done:

For compressing, we create a compressStream just before a cluster compression and free it just after. We don't reuse context.

It could be interesting to have to log of the creator. It gives some (timed) information about what have been processed. So we could see what is processing during the middle slop.

It may be also interesting to create a zim with lzma compression. If the memory usage is not the same, it may be relative to zstd compression itself.

parvit commented 2 years ago

For compressing, we create a compressStream just before a cluster compression and free it just after. We don't reuse context.

image

Your comment indicates than that this memory block at the end of my run is surely leaking, the free is not actually freeing the context as expected (but probably keeping internally the memory for reusing).

mgautierfr commented 2 years ago

zim::writer::taskRunner is the name of the worker thread. Your capture probably just capture a frame where the worker thread were closing (zim::writer::Cluster::close) the cluster and so compressing the content before it is written to the zim file after.

parvit commented 2 years ago

@mgautierfr I think you might be misunderstanding the meaning of the flamegraph.

The frame at the bottom is "ZSTD_resetCCtx_internal" is the one that still holds reference to memory at the end of the program run and that call accounted for nearly 1/3 of the memory of the run ( more precise data is contained in the files i uploaded, i can post them again if you need ).

I also would suggest that we all use the memray tool, without it this is kind of a guessing game.

mgautierfr commented 2 years ago

This is surprising as valgrind doesn't detect this

mgautierfr commented 2 years ago

more precise data is contained in the files i uploaded, i can post them again if you need

Please reupload it

parvit commented 2 years ago

This is surprising as valgrind doesn't detect this

Memray is specific to python so i think maybe valgrind being more general has more chances of seeing that as a false positive? If you can please share the valgrind command line and output.

Please reupload it

Attached find the aggregated results only, the raw data is quite big but i can arrange if you want.

results.zip

mgautierfr commented 2 years ago

We compress the content with a big compression level, it means that zstd allocate a lot of internal data. Each compression stream allocate around 90Mb (not including the compressed data itself). By default, there is 4 compression workers so compression may consume up to 360Mb (with the limitation that worker a shared between compression and indexation)

From https://github.com/openzim/sotoki/issues/243#issuecomment-1194302080, it seems that you peak head size is about 750Mb

I may misunderstood memray report, but from https://bloomberg.github.io/memray/flamegraph.html it seems that flamegraph show the memory usage of the peak memory usage. So it is normal that a lot of memory is allocated for compression and indexing.

(And that a constraint of this issue. libzim has a base memory allocation which is consequent. Using memory usage profiler on "small" data set may lead to wrong leads)

parvit commented 2 years ago

@mgautierfr Ok i did assume wrong then, but checking the page you linked pointed me to the --leaks option and i'll be sure to use it.

Thanks for the check.

parvit commented 2 years ago

Sorry for the wait, this run took a really long time (18h+) even with the small site.

Here's the run with memray that tracks only the memory leaks, note that to have an accurate tracking i've had to set the python memory allocator to malloc instead of the default pymalloc (which could also be why valgrind was confused by the results).

test_malloc.zip.001 test_malloc.zip.002 test_malloc.zip.003 test_malloc.zip.004 (If you can't extract rename the files with the ".xxx" after the ".zip", i had to change it to upload to github)

Following is my analysis of what i think are the most significant leaks that could contribute in a long run to the memory balooning.

Note: This is with both compression and indexing disabled, i'll see if i can make the run with both on.

=======

The reported leaks not listed here seem to me too small to contribute to the issue or mostly are static data allocated by the various imports / native classes.

unique_ptr<> from the Dirent pool do not seem to ever be released back into the pool

seems that this is never collected by the GC even if the parser.close() call is made and no native memory is attached

A couple of illustrations are downloaded and converted even if the no image was specified and the formats used for the conversion are kept in memory

The parsers remain in memory and every request to parse the tags leaks some memory.

SSL connection contexts are never released In actuality most the calls generated under Global.init(get_site(self.domain)) leak, maybe it points to the connections not being closed / released ?

Internals of the iso639 package seem to leak the dictionary on every request for the language code