Unable to download IWSLT datasets

adzcai commented 2 years ago

🐛 Bug

Describe the bug Unable to download IWSLT2016 or IWSLT2017 datasets.

To Reproduce Steps to reproduce the behavior:

from torchtext.datasets import IWSLT2016
train, valid, test = IWSLT2016()
src, tgt = next(iter(train))

The same error occurs when trying to use IWSLT2017.

Expected behavior The program returns the next src, tgt pair in the training data.

Screenshots Full error logs are in this gist.

Environment Included in gist above.

Additional context No additional context.

adzcai commented 2 years ago

As a temporary fix, I'm just downloading the datasets manually via the links in the documentation:

https://wit3.fbk.eu/2016-01 for IWSLT2016
https://wit3.fbk.eu/2017-01 for IWSLT2017

Then you can put the downloaded .tgz file into the proper directory: ~/.torchtext/cache/IWSLT2016/ for 2016 and similar for 2017.

Then torchtext will recognize the files and not download from GDrive.

austinvhuang commented 2 years ago

Original comment by @austinvhuang : I've run into this as well. Given the download problem in 0.11, maybe downloads checks could be in CI or integration tests?

Response by @parmeet below (sorry @austinvhuang, I meant to reply earlier but somehow ended up editing your original comment):

Duplicate of https://github.com/pytorch/text/issues/1620. Yes, we have this ongoing issue with torchtext <=0.11. could you please upgrade to 0.12 or try the temporary fix suggest here https://github.com/pytorch/text/issues/1676#issuecomment-1091071655.

maybe downloads checks could be in CI or integration tests?

We have had full testing earlier, but moved to mocked testing as explained in this issue #1493

parmeet commented 2 years ago

Apparently torchtext 0.12 also have this download issue, just that error message is not the same. Looking into it, the error is apparently same as found in 0.11 which is Internal error: confirm_token was not found in Google drive link. . I think, the reason it is shown as Internal error: headers don't contain content-disposition. in version 0.12 (which build datasets using datapipes) is because this check is removed from GdriveReader when confirm_token is None and hence it goes to next error message here which is what we see above.

cc: @Nayef211, @NivekT, @ejguan

ejguan commented 2 years ago

@parmeet Even though the root cause of this Error is unknown to me, do you think we could align the Error between two versions of TorchText? These OnlineReader could take extra keyword arguments and pass them to the request function to achieve the same behavior.

parmeet commented 2 years ago

do you think we could align the Error between two versions of TorchText?

I think one way to achieve this would be to go through the same error tracing as provided in torchtext download hook for google drive . I am not exactly sure why this error message is removed from the implementation in GDriveReader here when confirm_token is None?

ejguan commented 2 years ago

I am not exactly sure why this error message is removed from the implementation in GDriveReader here when confirm_token is None?

Can't find why via git blame as the actual commit was buried into combined commits. But, I think it's reasonable to add it back to the function in TorchData.

NivekT commented 2 years ago

do you think we could align the Error between two versions of TorchText?

I think one way to achieve this would be to go through the same error tracing as provided in torchtext download hook for google drive . I am not exactly sure why this error message is removed from the implementation in GDriveReader here when confirm_token is None?

I was under the impression that even when the confirm_token is None, the download can still be valid and work as intended. Hence, why #1620 was resolved. Is that incorrect? If that is true, we should not add that back into TorchData.

NivekT commented 2 years ago

@parmeet @Nayef211 Do we know what cause Internal error: headers don't contain content-disposition? And is it possible for the download be completed successfully even if the headers don't contain content-disposition?

ejguan commented 2 years ago

do you think we could align the Error between two versions of TorchText?

I think one way to achieve this would be to go through the same error tracing as provided in torchtext download hook for google drive . I am not exactly sure why this error message is removed from the implementation in GDriveReader here when confirm_token is None?

I was under the impression that even when the confirm_token is None, the download can still be valid and work as intended. Hence, why #1620 was resolved. Is that incorrect? If that is true, we should not add that back into TorchData.

@NivekT Thank you for pointing it out!!

Do we know what cause Internal error: headers don't contain content-disposition? And is it possible for the download be completed successfully even if the headers don't contain content-disposition?

@parmeet Does it mean the file not existing on the GDrive if content-disposition is not presented in the response? We may need to elaborate the Error detail from https://github.com/pytorch/data/blob/c1d89fe9a1b06e610f32f823359771557b1ca12a/torchdata/datapipes/iter/load/online.py#L90

lolzballs commented 2 years ago

Hi, I just wanted to point out that it seems there seems to be another issue at play here. With pytorch/data#442, we can get past the content-disposition error described in the previous comments. But there's still a problem with the dataset loading, as eventually it times out with the following message:

Exception: OnDiskCache Exception: data/IWSLT2017/IWSLT2017/2017-01-trnmted/texts/DeEnItNlRo/DeEnItNlRo/DeEnItNlRo-DeEnItNlRo/train.en-de.en expected to be written by different process, but file is not ready in 300 seconds.
This exception is thrown by __iter__ of MapperIterDataPipe()

I did some debugging and it seems a bit related to the nested caching in _filter_clean_cache. Removing the on_disk_cache and respective end_caching calls allowed it to load fine, but I think that's not a proper solution for the problem.

When I dug a bit deeper I found that the load_from_tar pipe in _filter_clean_cache never gets iterated over if the on_disk_cache is used, but I'm not really sure where to go from here. It does sound like a problem in torchdata rather than torchtext though...

Nayef211 commented 2 years ago

@lolzballs the caching issue you just mentioned seems to be related to https://github.com/pytorch/text/issues/1735.

cc @parmeet @VitalyFedyunin I wonder if this is caused by the cache inconsistency issue you mention here https://github.com/pytorch/text/issues/1735#issuecomment-1137723096 when using filters in our dataset logic.

lolzballs commented 2 years ago

@Nayef211 thanks, it does sound like exactly what I'm observing with IWSLT.

But I tried what is suggested in #1735 with (note the order of end_caching here and in the original code):

def _filter_clean_cache(cache_decompressed_dp, full_filepath, uncleaned_filename):

    cache_inner_decompressed_dp = cache_decompressed_dp.on_disk_cache(
        filepath_fn=partial(_return_full_filepath, full_filepath)
    )
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.open_files(mode="b").load_from_tar()
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.filter(partial(_filter_filename_fn, uncleaned_filename))
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.map(partial(_clean_files_wrapper, full_filepath))
    return cache_inner_decompressed_dp

I still get the same behaviour: the inner load_from_tar() never gets iterated over.

parmeet commented 2 years ago

Thanks @Nayef211, @lolzballs. I am also start seeing this issue but rather sporadically. The error is not reproducible unfortunately.

parmeet commented 2 years ago

Also I am not very clear what the timeout here is really doing. Per the doc Integer value of seconds to wait for uncached item to be written to disk, it seems it is the time waiting to download the file. But reducing the time to very small value (1 second) doesn't impact anything for me even though the download takes longer. I suspect if it has to do something with file locks?

Also I wonder if this and the issue https://github.com/pytorch/text/issues/1747 are somehow linked?

cc: @VitalyFedyunin

lolzballs commented 2 years ago

The error is not reproducible unfortunately.

Interesting, it seems this issue may be a regression? I was able to consistently reproduce the error using the latest main branches of both torchtext and torchdata if I remove the IWSLT cache folder (I would recommend just removing IWSLT2017/2017-01-trnmted, it leaves the original tar so we don't hit the quota from downloading it too much). But it seems like its fine with torchtext 0.12.0 and torchdata 0.3.0.

One other thing I should mention is that I found that when the timeout happens it leaves behind a train.en-de.en.promise file. When I removed only this promise file instead of the whole directory, it is able to successfully load the dataset.

VitalyFedyunin commented 2 years ago

lolzballs

@Nayef211 thanks, it does sound like exactly what I'm observing with IWSLT.

But I tried what is suggested in #1735 with (note the order of end_caching here and in the original code):

def _filter_clean_cache(cache_decompressed_dp, full_filepath, uncleaned_filename):

    cache_inner_decompressed_dp = cache_decompressed_dp.on_disk_cache(
        filepath_fn=partial(_return_full_filepath, full_filepath)
    )
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.open_files(mode="b").load_from_tar()
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.end_caching(mode="wb", same_filepath_fn=True)
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.filter(partial(_filter_filename_fn, uncleaned_filename))
    cache_inner_decompressed_dp = cache_inner_decompressed_dp.map(partial(_clean_files_wrapper, full_filepath))
    return cache_inner_decompressed_dp

I still get the same behaviour: the inner load_from_tar() never gets iterated over.

Could be the situation when locks from previous runs (with mispositioned filter) remained in the folder. Can you please clean it up and try again. If it still fails, a minimal reproducible example would really help me to debug the issue.

VitalyFedyunin commented 2 years ago

Also I am not very clear what the timeout here is really doing. Per the doc Integer value of seconds to wait for uncached item to be written to disk, it seems it is the time waiting to download the file. But reducing the time to very small value (1 second) doesn't impact anything for me even though the download takes longer. I suspect if it has to do something with file locks?

Also I wonder if this and the issue #1747 are somehow linked?

cc: @VitalyFedyunin

I agree that message is cryptic in case of errors which is not timeout, I will change it to some sort of diagnosis URL to help users figure out if the pipeline is bad and there are real errors.

lolzballs commented 2 years ago

@VitalyFedyunin

Could be the situation when locks from previous runs (with mispositioned filter) remained in the folder. Can you please clean it up and try again. If it still fails, a minimal reproducible example would really help me to debug the issue.

Pretty much taken from the docs, this is what I did. I tested again today with torchtext cb8475ed18 and torchdata cd3892790, still get the same timeout. Based on the same commit I changed the order of the filter as I posted above, and no luck there. I've always removed the cache (in my case data/IWSLT2017) before rerunning it to test.

If it helps, this is all done on an Arch Linux system. I'm not sure if it might be platform dependent.

ImahnShekhzadeh commented 1 year ago

Status today, with torchtext version

0.15.2+cpu

it is still not possible to download the IWSLT datasets.

requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://drive.google.com/uc?id=12ycYSzLIG253AFN35Y6qoyf9wtkOjakp
This exception is thrown by __iter__ of GDriveReaderDataPipe(skip_on_error=False, source_datapipe=OnDiskCacheHolderIterDataPipe, timeout=None)

PyTorch version: 2.0.1

pytorch / text

Unable to download IWSLT datasets #1676

🐛 Bug