GCS: Mosiacml-streaming overloads the GCP metadata service when too many processes are used.

smspillaz commented 3 months ago

Environment

OS: Debian 12 on GCE
Hardware (GPU, or instance type): N4

To reproduce

Steps to reproduce the behavior:

Run inside of a GCE machine with a service account (eg, using service account authentication with credentials coming from the metadata service)
Run inside of a docker container
Use torchrun to launch multiple processes (eg, 4)
Use StreamingDataset in a DataLoader with many worker processes (eg, 4)
Dataset source is from GCS

Because https://github.com/mosaicml/streaming/blob/main/streaming/base/storage/download.py#L235 queries the metadata service every time it is invoked in order to get credentials, doing this from multiple sub-processes at the same time can overload the service and exhaust the available connections, resulting in this warning:

WARNING Compute Engine Metadata server unavailable on attempt 1 of 5. Reason: [Errno 99] Cannot assign requested address

Backoff inside of google-auth doesn't appear to add any jitter (its just exponential), so if the worker subprocesses are running roughly synchronized, then this eventually fails, even if we increase the timeout as specified here.

In principle we should not have to query the metadata service all the time to get credentials. They are short-lived, but the google.auth.compute_engine.GCECredentials object provides an expired property (https://google-auth.readthedocs.io/en/master/reference/google.auth.compute_engine.html#module-google.auth.compute_engine). So it should be possible to cache the retrieved credentials for a given project ID and only refresh them when needed. In our case, we are monkey-patching the function to do the same thing:

def _patched_gce_credentials_wrapper():
    cached_credentials = {}

    def _patched_get_gce_credentials(request=None, quota_project_id=None):
        credentials, project_id = cached_credentials.get(quota_project_id, (None, None))
        now = datetime.datetime.now(datetime.timezone.utc)
        now = now.replace(tzinfo=None)

        if not credentials:
            print("Initial fetch GCP credentials")
            credentials, project_id = _get_gce_credentials(request=request, quota_project_id=quota_project_id)
            cached_credentials[quota_project_id] = (credentials, project_id)

        print("Credentials expired?", (credentials.expiry - datetime.timedelta(seconds=100) if credentials.expiry is not None else 0), now)

        if credentials.expiry is None or (credentials.expiry - datetime.timedelta(seconds=100)) < now:
            print("Refresh GCP credentials", credentials.expiry)
            request = google.auth.transport._http_client.Request()
            credentials.refresh(request)

        return credentials, project_id

    return _patched_get_gce_credentials

google.auth._default._get_gce_credentials = _patched_gce_credentials_wrapper()

Expected behavior

Shards can be fetched from GCS and too many concurrent queries are not made to the metadata service. Probably the fix here is to somehow cache and refresh the credentials in the same way, though its unclear to me where the caching should happen.

snarayan21 commented 3 months ago

I see, thanks for flagging. Yes, this is a known issue where we get credentials / metadata for each shard access. We have an eye on this and may pursue a permanent solution. Your current solution seems to be a possible route too.

karan6181 commented 2 weeks ago

@snarayan21, is there any temporary solution you have in mind? And what's the permanent solution you are thinking of? Is it wrapping all the download functions per cloud provider in a class?

@rishabhm12 @smspillaz Wondering, if anyone of you is interested in making a contribution?

mosaicml / streaming