pytorch / text

Models, data loaders and abstractions for language processing, powered by PyTorch
https://pytorch.org/text
BSD 3-Clause "New" or "Revised" License
3.51k stars 811 forks source link

Hashcode for test fold of Multi30k corrupt #2154

Closed fmohr closed 1 year ago

fmohr commented 1 year ago

🐛 Bug

Bug Description When loading the test data via

multi_datapipe = Multi30k(split="test")

I get the following error (only occurs on test split). It seems that the hash currently associated with the tar file does not correspond to the one of the actual tar file on the server.

RuntimeError: The computed hash 0681be16a532912288a91ddd573594fbdd57c0fbb81486eff7c55247e35326c2 of ~/.cache/torch/text/datasets/Multi30k/mmt16_task1_test.tar.gz does not match the expectedhash 6d1ca1dba99e2c5dd54cae1226ff11c2551e6ce63527ebb072a1f70f72a5cd36. Delete the file manually and retry.

Needless to say, I deleted the file manually (in fact was deleted manually automatically by script).

Expected Behvior I would this expect to work just as for split = "train" or split = "valid".

Environment torchtext version is 0.14.1 (the environment collection script as left in the template is 404).

Nayef211 commented 1 year ago

Hey @fmohr. We actually updated the expected hash of the file alongside where the file is downloaded from in https://github.com/pytorch/text/pull/2003. So the behavior you notice is actually correct since you had an outdated copy of the file downloaded in your cache. The expected resolution would be to delete the cached file manually?