In memory caching lead to crash

r0mainK commented 5 years ago

Issue

In the context of doing topic modeling experiments, @m09 and myself tried to use Gitbase to parse all blobs in tagged references of a given repository, in order to extract all identifiers, comments and literals. However, we have not been able to successfully use Gitbase to do so, and have had to switch to doing the parsing client side.

The reason for that is that, when querying Gitbase, we see the following behavior:

An increase in memory usage.
No decrease after time goes by.
When all available memory is consumed, an increase in block I/O and a quasi stagnation of the memory consumed by Gitbase at 99.999 ... %, indicating heavy use of Swap memory.
Server crash if the query goes on for too long past that point.

We still see the same behavior when retrieving only the blob contents from Gitbase, however the memory consumed is not an issue, as it is much less then when parsing UASTs. We have inferred that there was some caching going one, and after talking about the issue on the dev-processing channel, we tried to disable the caching - however it changed nothing. Javi told us that the caching we had disabled was for go-git cache, so it is probably something else.

What we don't understand is why we cannot get rid of the behavior, i.e. why once a blob has been parsed and returned client side it seemingly remains in memory.

Steps to reproduce

Launch gitbase and babelfish containers:

docker run -d --rm --name bblfshd --privileged -p 9432:9432 -m 4g bblfsh/bblfshd:v2.14.0-drivers
docker run -d --rm --name gitbase -p 3306:3306 --link bblfshd:bblfshd -e BBLFSH_ENDPOINT=bblfshd:9432 -m 2g -v /path/to/repos:/opt/repos srcd/gitbase:latest

With /path/to/repos pointing to a repository, for instance pytorch. Then, open two more terminals to monitor what's happening with docker stats, and run queries like this one for example for pytorch, using for example the mySQL client:

   SELECT
        cf.file_path,
        cf.blob_hash,
        LANGUAGE(cf.file_path) as lang,
        uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value")
    FROM repositories r
        NATURAL JOIN refs rf
        NATURAL JOIN commit_files cf
        NATURAL JOIN files f
    WHERE r.repository_id = 'pytorch'
        AND is_tag(rf.ref_name)
        AND lang ='Python'

You should see the memory usage of the gitbase container increase sharply until hitting 2 GB, then a heavy increase in BLOCK I/O, and finally the container will crash.

ajnavarro commented 5 years ago

From the top of my head, some things that you must to do to try to mitigate the problem:

You are limiting memory usage to 2gb, then if suddenly a huge file is loaded into the cache, the container will crash. Add a filter on where clause to limit the size of processed files, per example to 10M or so.
You are processing all the references for all the repositories. If that is not intended and you want to only process HEAD, add a filter on where clause to just use HEAD references on all repos.
Do not process the file if it is binary. Add IS_BINARY to where clause.

m09 commented 5 years ago

Thanks for the quick advice @ajnavarro :bowing_man:

As @r0mainK mentioned we mitigated the problem already (by using gitbase for blobs and using the bblfsh python client to then extract UASTs). This is good enough for now but we will need to scale up this effort though in the future.

ajnavarro commented 5 years ago

Also, if the problem is only when parsing UASTs, you can change GITBASE_UAST_CACHE_SIZE env from 10k as default to a lower number.

r0mainK commented 5 years ago

@ajnavarro unfortunately, still on the same example (Python files in Pytorch's tagged references):

mysql> SELECT COUNT((t.blob_hash, t.file_path)) FROM (
    ->    SELECT DISTINCT
    ->         cf.blob_hash,
    ->         cf.file_path,
    ->         LANGUAGE(cf.file_path) as lang
    ->     FROM repositories r
    ->         NATURAL JOIN refs rf
    ->         NATURAL JOIN commit_files cf
    ->         NATURAL JOIN files f
    ->     WHERE r.repository_id = 'pytorch'
    ->         AND is_tag(rf.ref_name)
    ->         AND lang ='Python' ) t;
+-----------------------------------+
| COUNT((t.blob_hash, t.file_path)) |
+-----------------------------------+
|                              4211 |
+-----------------------------------+

mysql> SELECT COUNT((t.blob_hash, t.file_path)) FROM (
    ->    SELECT DISTINCT
    ->         cf.blob_hash,
    ->         cf.file_path,
    ->         LANGUAGE(cf.file_path) as lang
    ->     FROM repositories r
    ->         NATURAL JOIN refs rf
    ->         NATURAL JOIN commit_files cf
    ->         NATURAL JOIN files f
    ->     WHERE r.repository_id = 'pytorch'
    ->         AND is_tag(rf.ref_name)
    ->         AND lang ='Python' 
    ->         AND NOT IS_BINARY(f.blob_content)
    ->         AND f.blob_size < 1000000
    ->     ) t;
+-----------------------------------+
| COUNT((t.blob_hash, t.file_path)) |
+-----------------------------------+
|                              4211 |
+-----------------------------------+

As you can see, the issue does not seem to be mitigated using these additional clauses, as they don't apply. Furthermore, the intent is to process all files out of tagged references, not only the HEAD - hence the is_tag clause.

Additionally, the issue is that even when we try to separate large queries into smaller ones in order to parse less blobs per query, for example parsing language by language or ref by ref, since the memory is not released it ended up changing nothing.

EDIT: had not seem your comment on GITBASE_UAST_CACHE_SIZE, I thought it would be restricted by setting the cache option when launching gitbase, gonna check if it works :)

r0mainK commented 5 years ago

@ajnavarro so setting GITBASE_UAST_CACHE_SIZE=0 seemed to make things better, I was able to parse all 4211 python blobs, however I could still see the memory increase to the limit - just slower. When I tried parsing all blobs from tagged references (after restarting the container and setting the memory to 4GB), it maxed memory then crashed after parsing 4393 blobs (out of 20k blobs).

It seems there is some other caching going on, that is neither controlled by this variable or (I guess), GITBASE_CACHESIZE_MB, since it defaults to 512MB. That or some kind of memory leak, although I wouldn't know.

ajnavarro commented 5 years ago

@r0mainK we have several LRU caches based on the number of elements, not on the total size of these elements. We should have a look into that and find ways to homogenize how to set limits to cache more user-friendly.

r0mainK commented 5 years ago

Okay great, well anyway as Hugo said for the moment we are bypassing Gitbase - even though that comes with it's share of problems - but ideally we would like to only use Gitbase, so I'll be looking forward to trying to solve this issue with these limits if you find the time :)

EDIT: my bad missclicked and closed the issue -_-, I think this was automatically moved to TODO from Backlog due to Automation

erizocosmico commented 5 years ago

This will probably be fixed by https://github.com/src-d/gitbase/issues/929 so let's put it on blocked until there is a release containing that.

erizocosmico commented 5 years ago

Currently there's a PR #957 adding it to gitbase. When it's merged and released let's give it a try to see if this is solved.

agarciamontoro commented 5 years ago

I just tested this with the gitbase image built from master, using the same set up described by @r0mainK but adding -e MAX_MEMORY=1024 to the gitbase container to limit the cache memory to 1GiB and it seems to work. The memory usage of the gitbase container did not exceed ~1.3GiB, the block I/O did not go crazy and the query actually finished.

It is not yet released, but you can take an early look if you need it, @r0mainK.

erizocosmico commented 5 years ago

@r0mainK apparently it's already released in v0.24.0-beta3. Can you give it a try and see if it works for you?

m09 commented 5 years ago

Just FYI @r0mainK comes back from holidays next monday.

erizocosmico commented 5 years ago

No problem @m09

erizocosmico commented 5 years ago

UPDATE: beta3 does not fix the error, we're cutting a rc1 version with the complete fix.

agarciamontoro commented 5 years ago

Released! You can use this version to test: https://github.com/src-d/gitbase/releases/tag/v0.24.0-rc1

r0mainK commented 5 years ago

I just tested, so for the query above it seems to be working, I saw on docker stats that the memory usage stagnates around 1.13 GB after hitting the limit :+1:

For the query above, which requires some caching before returning the result:

   SELECT
        cf.file_path,
        cf.blob_hash,
        LANGUAGE(cf.file_path) as lang,
        COUNT(uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value"))
    FROM repositories r
        NATURAL JOIN refs rf
        NATURAL JOIN commit_files cf
        NATURAL JOIN files f
    WHERE r.repository_id = 'pytorch'
        AND is_tag(rf.ref_name)
        AND lang ='Python'
    GROUP BY lang, cf.file_path, cf.blob_hash;

The memory stagnated at about 1.26 GB, but still finished. It did not increase even after subsequent queries. So yeah, looks good guys :)

ajnavarro commented 5 years ago

@r0mainK shall we close this issue then? Thanks!

r0mainK commented 5 years ago

Yes I think you can. I believe there is still some leaking, as depending on the query the amount of memory still can go over the hard limit set in environment variables. However it is nothing compared to before so imo this is resolved.

erizocosmico commented 5 years ago

It's very hard to accurately ensure it does not go over the hard limit, so it's more of a soft limit. But should not be much bigger than the limit that has been set.

ajnavarro commented 5 years ago

Closing the issue as resolved then. Thanks!

src-d / gitbase

In memory caching lead to crash #922

Issue

Steps to reproduce