mir-dataset-loaders / mirdata

Python library for working with Music Information Retrieval datasets
https://mirdata.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
357 stars 59 forks source link

Support remote filepaths #462

Open rabitt opened 3 years ago

rabitt commented 3 years ago

When data is in a remote storage location e.g. AWS, Google Cloud Storage, Google Drive, etc, using mirdata becomes clunky, and we've overloaded the API to support it. Instead, it would be great to simplify the API and instead directly support remote file paths.

I propose we start by directly supporting Google Cloud Storage filepaths (gs://...) so they work seamlessly.

cc @drubinstein @psobot

drubinstein commented 3 years ago

I forget what the current interface looks like, but maybe it'd be better if the user is responsible for giving a callback that handles finding the file and downloading it (or giving it to a Bytes/StringIO) and mirdata handles the deserialization. There will always be another way to download a file.

To support both local and remote file locations will probably end up massively increasing python library dependencies and could probably be a library on its own. The easiest way to do it though would be to use the protocol as the way to determine how to fetch the file e.g.

if startswith(gs://)
elif startswith(s3://)
elif startswith(file:///)
...
rabitt commented 3 years ago

I forget what the current interface looks like, but maybe it'd be better if the user is responsible for giving a callback that handles finding the file and downloading it (or giving it to a Bytes/StringIO) and mirdata handles the deserialization. There will always be another way to download a file.

The idea is that the user shouldn't have to care about filepaths at all once they've set data_home. So internally when mirdata needs to access e.g. a metadata file on gcs, it handles downloading/opening it. We do this now already, but it only works for local files.

drubinstein commented 3 years ago

You could write a bunch of callbacks and the user specifies which one they want to use. That way they can use the file one or maybe a GCS one if you want to add it. This would also allow a user to add in their own downloader if need be instead of restricting it to one that you only support. I think my thought is instead of focusing on supporting "all remote filepaths" it may be more useful to first focus on making the interface extensible.

A "universal" downloader sounds like an independent project and maybe could be another python package in this organization that this repo uses.

drubinstein commented 3 years ago

An alternative to writing an in-house library is to use cloudstorage or libcloud which seems to fit this use case.