gitpython-developers / GitPython

GitPython is a python library used to interact with Git repositories.
http://gitpython.readthedocs.org
BSD 3-Clause "New" or "Revised" License
4.66k stars 906 forks source link

IOError: [Errno 24] Too many open files #421

Open cptanalatriste opened 8 years ago

cptanalatriste commented 8 years ago

I'm using GitPython to do data mining on Git repository on a Windows 10 laptop. To retrieve the stats for commits -which might be on different repositories- I do the following:

    #Tried this. It didn't work
    if platform.system() == 'Windows':
        import win32file
        win32file._setmaxstdio(2048)

    #About 20 000 commits
    commits = get_commits()
    for commit_sha, repository in commits:
        repository_location = REPO_LOCATION + repository
        repository = git.Repo(repository_location)
        commit = repository.rev_parse(commit_sha)

        total_stats = commit.stats.total
        process_stats(total_stats)

        #Tried this also. It won't work
        del total_stats
        del repository

However, I get the following error message every time:

  File "my_code.py", line 126, in my_code
  File "\Anaconda2\lib\site-packages\git\objects\commit.py", line 229, in stats
  File "\Anaconda2\lib\site-packages\gitdb\util.py", line 237, in __getattr__
  File "\Anaconda2\lib\site-packages\git\objects\commit.py", line 141, in _set_cache_
  File "\Anaconda2\lib\site-packages\git\db.py", line 45, in stream
  File "\Anaconda2\lib\site-packages\git\cmd.py", line 982, in stream_object_data
  File "\Anaconda2\lib\site-packages\git\cmd.py", line 948, in _get_persistent_cmd
  File "\Anaconda2\lib\site-packages\git\cmd.py", line 878, in _call_process
  File "\Anaconda2\lib\site-packages\git\cmd.py", line 604, in execute
  File "\Anaconda2\lib\subprocess.py", line 732, in __init__
IOError: [Errno 24] Too many open files

Is there a way to free resource on every loop iteration to avoid the error message?

Byron commented 8 years ago

I just spent some time to find something along the lines of calling release() on the odb instance of the repository, but only came to the conclusion that such functionality does not exist. When I wrote GitPython for py2.X, I was counting on the somewhat deterministic destruction of objects, and built everything around that. However, by now this is simply not the case anymore (if it ever was ...), so GitPython does have a problem with releasing system resources properly in some cases.

A known workaround for this issue is to fork code into it's own subprocess, to allow it to be cleaned up by the operating system when done. Doing this in your case might add a lot of complexity.

Something you could try is to use libgit2 directly, which will by it's very nature provide methods to release resources explicitly.

Also I am afraid there no fix for this issue at this time, unless someone is willing to dig in and assure respective release methods are added to the types in question.

abourget commented 1 year ago

I'm hitting this, 9 years later. I can't believe there's nothing that can be done?!