libgit2 / pygit2

Python bindings for libgit2
https://www.pygit2.org/
Other
1.58k stars 382 forks source link

Excessive memory usage when accessing blob.size from a lot of blobs #1124

Closed SoniEx2 closed 5 months ago

SoniEx2 commented 2 years ago

When scanning a repo through such means:

todocommits = set()

for ref in repo.references:
    ref = repo.references.get(ref)
    todocommits.add(ref.peel(pygit2.Commit))

todotrees = set()

while todocommits:
    c = todocommits.pop()
    todotrees.add(c.tree)
    todocommits.update(c.parents)

todoblobs = {}

while todotrees:
    t = todotrees.pop()
    for obj in t:
        if isinstance(obj, pygit2.Blob):
            blobmeta = todoblobs.setdefault(obj, [])
            blobmeta += [(obj.filemode, obj.name)]
            # obj.size
        elif isinstance(obj, pygit2.Tree):
            todotrees.add(obj)
        else:
            raise TypeError

# while todoblobs: ...

visiting obj.size at the given point is the difference between getting killed by the oom_killer vs using only about 160MiB of peak RAM.

goDeni commented 6 months ago

This is because it read all file

I captured strace file to make shure of this

access(".../objects/41/043eaf7a378789fc54d5e7ccd5f7b878a9dba7", F_OK) = 0
newfstatat(AT_FDCWD, ".../objects/41/043eaf7a378789fc54d5e7ccd5f7b878a9dba7", {st_mode=S_IFREG|0444, st_size=1822988139, ...}, 0) = 0
openat(AT_FDCWD, ".../objects/41/043eaf7a378789fc54d5e7ccd5f7b878a9dba7", O_RDONLY|O_CLOEXEC) = 3
mmap(NULL, 1822990336, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7ffa38b2e000
read(3, "x\1Tzct\245M\320m\214\311\304\2669\261}b\333\266m\333\2661\261&\236dbs2\261"..., 1822988139) = 1822988139
close(3)                                = 0

read syscall show that 1822988139 bytes have been readed!

Same problem with attribute .is_binary in the Blob object type

goDeni commented 6 months ago

Please don't ignore that problem @jdavid

jdavid commented 6 months ago

In the sample code above try replacing:

blobmeta = todoblobs.setdefault(obj, [])

With:

blobmeta = todoblobs.setdefault(obj.id, [])

What happens is that to get obj.size the libgit2 object is loaded, and pygit2 keeps a reference to it in obj. This reference will be freed when obj is destroyed. But by keeping it in todoblobs this won't happend until the end of the program.