salu133445 / muspy

A toolkit for symbolic music generation
https://salu133445.github.io/muspy/
MIT License
435 stars 51 forks source link

download_and_extract failure #34

Open cifkao opened 3 years ago

cifkao commented 3 years ago

RemoteDataset(download_and_extract=True) fails if:

In the first case, we can simply create the directory (it's only going to be used for this one dataset anyway). In the second case, we can skip extraction if the .muspy.success file exists. Or (to make sure the data is not corrupt) we could just check the files that exist before trying to overwrite them.

salu133445 commented 3 years ago

I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.

For the second case, the current implementation always download, extract and overwrite the data. We could add a overwrite argument to prevent repeating this by looking for .muspy.success.

In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well. For example, if the existing achieve size is incorrect, then we should remove it and redownload it. Also, if part of the files are already extracted, we could skip them, which can be done by modifying datasets.utils.extract_archive using TarFile.getnames with some existence checks.

cifkao commented 3 years ago

I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.

But mkdir(parents=False, exist_ok=True) prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).

cifkao commented 3 years ago

And the issue I was having with my dataset is that the archive contained some read-only files. So creating the dataset a second time resulted in a PermissionError. Also, I can imagine a dataset to exist in a shared location which is not writable.

cifkao commented 3 years ago

In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well.

If the extraction process is interrupted, .muspy.success should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.

salu133445 commented 3 years ago

But mkdir(parents=False, exist_ok=True) prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).

Yup. Sounds good.

salu133445 commented 3 years ago

If the extraction process is interrupted, .muspy.success should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.

That's right. Totally forgot that.

So what's not done yet is to have a boolean argument overwrite in Dataset.__init__ that passed to datasets.utils.extract_archive to control whether existing files should be skipped or overwritten.

salu133445 commented 3 years ago

datasets.utils.download_url now has an overwrite argument, but it's actually quite tricky to add it to datasets.utils.extract_archive.

salu133445 commented 3 years ago

Checking the existence of .muspy.success for download_and_extract is not intuitive enough by its name. Perhaps what would be more handy is an argument like make_sure_exists or download_and_extract="auto" that automatically checks the files if exists and downloads/extracts them if necessary.

cifkao commented 3 years ago

As for download: I think the old behaviour (i.e. what happens now with overwrite=False) was reasonable. What is there to gain by re-downloading the archive if it already exists and the checksum is correct?

Basically, there should be a way (preferably the default one) to create a dataset that:

  1. doesn't fail unnecessarily (e.g. because of permissions of files that already exist)
  2. doesn't perform unnecessary long-running tasks (downloading and extracting things that have already been downloaded and extracted)

And I think that checking that an archive is correctly extracted can be an example of an unnecessary long-running task. Especially if the dataset is already converted to the MusPy format and you end up using the converted version, which you also do not check.