jelmer / dulwich

Pure-Python Git implementation
https://www.dulwich.io/
Other
2.06k stars 400 forks source link

Dulwich's fetch hangs #799

Open dragos-a opened 4 years ago

dragos-a commented 4 years ago

Hi, We are using Dulwich (0.20.6) to work with git repos. We are connecting to a larger repo using HttpGitClient and have a thread periodically doing pull/fetch operations.

After a while, for this specific repo, fetch hangs. We debugged, and it hangs at the following line from HttpGitClient._http_request: resp = self.pool_manager.request("POST", url, headers=req_headers, body=data), which does a POST using urllib3's PoolManager requests. For now, in default_urllib3_manager, we added a timeout as: manager = pool_manager_cls(headers=headers, timeout=10.0, **kwargs), and solved the problem of hanging, though we get many timeouts (we are still looking into this).

We are wondering if we could please have a feature for users to configure a timeout parameter to Dulwich APIs (for our example in HttpGitClient's constructor, etc), with any default timeout option that you prefer.

Thanks!

jelmer commented 4 years ago

You should already be able to set a timeout by passing in a custom urllib3 PoolManager into HttpGitClient that has the timeout option set. Open to talking about what we can do to make this sort of thing easier though.

dragos-a commented 4 years ago

Thanks a lot for the suggestion. Although passing a custom PoolManager worked for us, we still have the underlying issue with the Porcelain fetch below hanging (or timeout-ing, if we add timeout) after a few iterations, and we would really appreciate your help. Note that the problem only happens when any file is changed (and committed) in the remote repo, independently of this fetch script. We checked, and a similar git-native fetch script (performing "git fetch --all", please see attached) never hangs for the same repo.

We debugged the Porcelain-fetch hang issue in client.py and, similar to what I mentioned in the original post, it hangs in HttpGitClient's fetch_pack method, specifically in the REST "git-upload-pack" call (which is after preparing the data in "_handle_upload_pack_head" method). In the attached file "data_and_headers.txt", we printed the headers (line 1 from the file) and the data (lines 3 - end of file) of the POST "git-upload-pack" REST call during which the fetch hangs/timeouts. The size of the POST data is larger than 11 MB, mostly the "haves" in our local repo.

Could you please help us identify what we are doing wrong in our Dulwich script? Also, please let us know any further information that you may need. Thanks a lot!

data_and_headers.txt git_fetch.py.txt

import time
from dulwich import porcelain
from dulwich.repo import MemoryRepo
ITERS = 20
url = 'https://user:password@ghe_server.../Repo.git'

def run_fetches():
    repo = MemoryRepo()
    for i in range(ITERS):
        print('Fetching.')
        fetch_result = porcelain.fetch(repo, remote_location=url)
        for k, v in list(fetch_result.refs.items()):
            repo.refs.set_if_equals(k, None, v)
        time.sleep(30)

run_fetches()
jelmer commented 4 years ago

The number of "haves" is surprising - does your local repository happen to have a lot of refs? If not, what does the revision graph look like - is there a large number of merge commits?

Can you reproduce the issue with just the script pasted above, or is it just an excerpt? If it's an excerpt, perhaps other things in the process are triggering the breakage? What does the server run, and is it spinning while these requests are being sent? Do the data and headers look similar if you use "git fetch --all" ?

dragos-a commented 4 years ago

Thanks a lot for your prompt reply! About the number of references in our local repo: it has 7348 branches and 7770 tags. The repo has 4740 merge commits (>= 2 parents).

Yes, we can reproduce the fetch hanging with just the exact script above (which uses HttpGitClient and MemoryRepo). For us, it hangs during the first 20 iterations. As detailed above, the issue only happens when we update any file in the remote repo (if no update, in fetch_pack, the "wants" list is empty and it returns directly). If we instead use SSHGitClient (and MemoryRepo), fetches seem to work without hanging.

We are running GitHub Enterprise Server 2.20.8. Other independent processes are currently calling REST Github APIs directly on this server. As you suggested, I ran "git fetch --all -vvv" after I added two commits in branches "test01", "test02" (please note that our internal test suites also run on this repo, so branches may be added/deleted). The output is in the file "git_fetch_all.txt": the "git-upload-pack" POST data has a small number of "haves" (when compared to Dulwich's POST). git_fetch_all.txt

Thanks again for your help!

nanonyme commented 3 months ago

@jelmer this sounds a bit dangerous. Are you here declaring that urllib3 is external API of dulwich by stating that the official way to handle timeouts is to use urllib3 PoolManager?

jelmer commented 3 months ago

@nanonyme Which bit sounds dangerous? Dulwich doesn't expose a way to change the timeout and I don't see a good reason to, but you can interact with urllib3 directly.

nanonyme commented 3 months ago

Mainly if urllib3 PoolManager is part of your official API, switching to another HTTP library later will be an API break from user point of view. Silent as well since the unused kwarg is suppressed without any warning.

nanonyme commented 3 months ago

Timeouts are always good as long as they are conservative. It's completely possible any communication over network can hang even when software is working correctly.