Closed r0mainK closed 5 years ago
From the top of my head, some things that you must to do to try to mitigate the problem:
Thanks for the quick advice @ajnavarro :bowing_man:
As @r0mainK mentioned we mitigated the problem already (by using gitbase
for blobs and using the bblfsh
python client to then extract UASTs). This is good enough for now but we will need to scale up this effort though in the future.
Also, if the problem is only when parsing UASTs, you can change GITBASE_UAST_CACHE_SIZE env from 10k as default to a lower number.
@ajnavarro unfortunately, still on the same example (Python files in Pytorch's tagged references):
mysql> SELECT COUNT((t.blob_hash, t.file_path)) FROM (
-> SELECT DISTINCT
-> cf.blob_hash,
-> cf.file_path,
-> LANGUAGE(cf.file_path) as lang
-> FROM repositories r
-> NATURAL JOIN refs rf
-> NATURAL JOIN commit_files cf
-> NATURAL JOIN files f
-> WHERE r.repository_id = 'pytorch'
-> AND is_tag(rf.ref_name)
-> AND lang ='Python' ) t;
+-----------------------------------+
| COUNT((t.blob_hash, t.file_path)) |
+-----------------------------------+
| 4211 |
+-----------------------------------+
mysql> SELECT COUNT((t.blob_hash, t.file_path)) FROM (
-> SELECT DISTINCT
-> cf.blob_hash,
-> cf.file_path,
-> LANGUAGE(cf.file_path) as lang
-> FROM repositories r
-> NATURAL JOIN refs rf
-> NATURAL JOIN commit_files cf
-> NATURAL JOIN files f
-> WHERE r.repository_id = 'pytorch'
-> AND is_tag(rf.ref_name)
-> AND lang ='Python'
-> AND NOT IS_BINARY(f.blob_content)
-> AND f.blob_size < 1000000
-> ) t;
+-----------------------------------+
| COUNT((t.blob_hash, t.file_path)) |
+-----------------------------------+
| 4211 |
+-----------------------------------+
As you can see, the issue does not seem to be mitigated using these additional clauses, as they don't apply. Furthermore, the intent is to process all files out of tagged references, not only the HEAD - hence the is_tag
clause.
Additionally, the issue is that even when we try to separate large queries into smaller ones in order to parse less blobs per query, for example parsing language by language or ref by ref, since the memory is not released it ended up changing nothing.
EDIT: had not seem your comment on GITBASE_UAST_CACHE_SIZE, I thought it would be restricted by setting the cache option when launching gitbase, gonna check if it works :)
@ajnavarro so setting GITBASE_UAST_CACHE_SIZE=0
seemed to make things better, I was able to parse all 4211 python blobs, however I could still see the memory increase to the limit - just slower. When I tried parsing all blobs from tagged references (after restarting the container and setting the memory to 4GB), it maxed memory then crashed after parsing 4393 blobs (out of 20k blobs).
It seems there is some other caching going on, that is neither controlled by this variable or (I guess), GITBASE_CACHESIZE_MB
, since it defaults to 512MB. That or some kind of memory leak, although I wouldn't know.
@r0mainK we have several LRU caches based on the number of elements, not on the total size of these elements. We should have a look into that and find ways to homogenize how to set limits to cache more user-friendly.
Okay great, well anyway as Hugo said for the moment we are bypassing Gitbase - even though that comes with it's share of problems - but ideally we would like to only use Gitbase, so I'll be looking forward to trying to solve this issue with these limits if you find the time :)
EDIT: my bad missclicked and closed the issue -_-, I think this was automatically moved to TODO from Backlog due to Automation
This will probably be fixed by https://github.com/src-d/gitbase/issues/929 so let's put it on blocked until there is a release containing that.
Currently there's a PR #957 adding it to gitbase. When it's merged and released let's give it a try to see if this is solved.
I just tested this with the gitbase image built from master
, using the same set up described by @r0mainK but adding -e MAX_MEMORY=1024
to the gitbase container to limit the cache memory to 1GiB and it seems to work. The memory usage of the gitbase container did not exceed ~1.3GiB, the block I/O did not go crazy and the query actually finished.
It is not yet released, but you can take an early look if you need it, @r0mainK.
@r0mainK apparently it's already released in v0.24.0-beta3
. Can you give it a try and see if it works for you?
Just FYI @r0mainK comes back from holidays next monday.
No problem @m09
UPDATE: beta3 does not fix the error, we're cutting a rc1 version with the complete fix.
Released! You can use this version to test: https://github.com/src-d/gitbase/releases/tag/v0.24.0-rc1
I just tested, so for the query above it seems to be working, I saw on docker stats that the memory usage stagnates around 1.13 GB after hitting the limit :+1:
For the query above, which requires some caching before returning the result:
SELECT
cf.file_path,
cf.blob_hash,
LANGUAGE(cf.file_path) as lang,
COUNT(uast_extract(uast(f.blob_content, LANGUAGE(cf.file_path), '//uast:String'), "Value"))
FROM repositories r
NATURAL JOIN refs rf
NATURAL JOIN commit_files cf
NATURAL JOIN files f
WHERE r.repository_id = 'pytorch'
AND is_tag(rf.ref_name)
AND lang ='Python'
GROUP BY lang, cf.file_path, cf.blob_hash;
The memory stagnated at about 1.26 GB, but still finished. It did not increase even after subsequent queries. So yeah, looks good guys :)
@r0mainK shall we close this issue then? Thanks!
Yes I think you can. I believe there is still some leaking, as depending on the query the amount of memory still can go over the hard limit set in environment variables. However it is nothing compared to before so imo this is resolved.
It's very hard to accurately ensure it does not go over the hard limit, so it's more of a soft limit. But should not be much bigger than the limit that has been set.
Closing the issue as resolved then. Thanks!
Issue
In the context of doing topic modeling experiments, @m09 and myself tried to use Gitbase to parse all blobs in tagged references of a given repository, in order to extract all identifiers, comments and literals. However, we have not been able to successfully use Gitbase to do so, and have had to switch to doing the parsing client side.
The reason for that is that, when querying Gitbase, we see the following behavior:
We still see the same behavior when retrieving only the blob contents from Gitbase, however the memory consumed is not an issue, as it is much less then when parsing UASTs. We have inferred that there was some caching going one, and after talking about the issue on the dev-processing channel, we tried to disable the caching - however it changed nothing. Javi told us that the caching we had disabled was for
go-git cache
, so it is probably something else.What we don't understand is why we cannot get rid of the behavior, i.e. why once a blob has been parsed and returned client side it seemingly remains in memory.
Steps to reproduce
Launch gitbase and babelfish containers:
With
/path/to/repos
pointing to a repository, for instancepytorch
. Then, open two more terminals to monitor what's happening withdocker stats
, and run queries like this one for example for pytorch, using for example the mySQL client:You should see the memory usage of the gitbase container increase sharply until hitting 2 GB, then a heavy increase in BLOCK I/O, and finally the container will crash.