Open cifkao opened 3 years ago
I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e
can be hard to revert. Interactive prompts would be preferable in this case.
For the second case, the current implementation always download, extract and overwrite the data. We could add a overwrite
argument to prevent repeating this by looking for .muspy.success
.
In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well. For example, if the existing achieve size is incorrect, then we should remove it and redownload it. Also, if part of the files are already extracted, we could skip them, which can be done by modifying datasets.utils.extract_archive
using TarFile.getnames
with some existence checks.
I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.
But mkdir(parents=False, exist_ok=True)
prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root
itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).
And the issue I was having with my dataset is that the archive contained some read-only files. So creating the dataset a second time resulted in a PermissionError
. Also, I can imagine a dataset to exist in a shared location which is not writable.
In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well.
If the extraction process is interrupted, .muspy.success
should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.
But
mkdir(parents=False, exist_ok=True)
prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but notroot
itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).
Yup. Sounds good.
If the extraction process is interrupted,
.muspy.success
should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.
That's right. Totally forgot that.
So what's not done yet is to have a boolean argument overwrite
in Dataset.__init__
that passed to datasets.utils.extract_archive
to control whether existing files should be skipped or overwritten.
datasets.utils.download_url
now has an overwrite
argument, but it's actually quite tricky to add it to datasets.utils.extract_archive
.
Checking the existence of .muspy.success
for download_and_extract
is not intuitive enough by its name. Perhaps what would be more handy is an argument like make_sure_exists
or download_and_extract="auto"
that automatically checks the files if exists and downloads/extracts them if necessary.
As for download
: I think the old behaviour (i.e. what happens now with overwrite=False
) was reasonable. What is there to gain by re-downloading the archive if it already exists and the checksum is correct?
Basically, there should be a way (preferably the default one) to create a dataset that:
And I think that checking that an archive is correctly extracted can be an example of an unnecessary long-running task. Especially if the dataset is already converted to the MusPy format and you end up using the converted version, which you also do not check.
RemoteDataset(download_and_extract=True)
fails if:In the first case, we can simply create the directory (it's only going to be used for this one dataset anyway). In the second case, we can skip extraction if the
.muspy.success
file exists. Or (to make sure the data is not corrupt) we could just check the files that exist before trying to overwrite them.