jelmer / dulwich

Pure-Python Git implementation
https://www.dulwich.io/
Other
2.06k stars 396 forks source link

drop caching of full HTTP response #966

Closed jelmer closed 2 years ago

jelmer commented 2 years ago

dulwich currently caches the full HTTP response in memory when talking to remote servers. This obviously has both memory and CPU consequences. I attempted to revert this in 97814a9bb5294c4309c83603f261905d99dfc240, but that broke access to some HTTP servers.

Example URL that broke: https://forge.softwareheritage.org/source/swh-loader-core

I've reintroduced the change for now, but we should figure out why the BytesIO() call is necessary today and attempt to remove it again once we find out.

anlambert commented 2 years ago

I looked into this issue and it seems the reason data cannot be read when trying to discover references in the repository above is because the Accept-Encoding HTTP header is set to gzip when sending request to the /info/refs endpoint.

The following diff fixes the observed issue, all tests are still passing but not sure about the side effect of that change though.

diff --git a/dulwich/client.py b/dulwich/client.py
index 33742b0d..0566ec97 100644
--- a/dulwich/client.py
+++ b/dulwich/client.py
@@ -1932,7 +1932,7 @@ class AbstractHttpGitClient(GitClient):
         if self.dumb is not True:
             tail += "?service=%s" % service.decode("ascii")
         url = urljoin(base_url, tail)
-        resp, read = self._http_request(url, headers, allow_compression=True)
+        resp, read = self._http_request(url, headers)

         if resp.redirect_location:
             # Something changed (redirect!), so let's update the base URL
@@ -2236,7 +2236,7 @@ class Urllib3HttpGitClient(AbstractHttpGitClient):
             resp.redirect_location = resp_url if resp_url != url else ""
         # TODO(jelmer): Remove BytesIO() call that caches entire response in
         # memory. See https://github.com/jelmer/dulwich/issues/966
-        return resp, BytesIO(resp.data).read
+        return resp, resp.read

 HttpGitClient = Urllib3HttpGitClient

Hope that it will hep.

jelmer commented 2 years ago

Thanks, that's definitely helpful.