huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19k stars 2.63k forks source link

Issue: Dataset download error #2076

Open XuhuiZhou opened 3 years ago

XuhuiZhou commented 3 years ago

The download link in iwslt2017.py file does not seem to work anymore.

For example, FileNotFoundError: Couldn't find file at https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz

Would be nice if we could modify it script and use the new downloadable link?

albertvillanova commented 3 years ago

Hi @XuhuiZhou, thanks for reporting this issue.

Indeed, the old links are no longer valid (404 Not Found error), and the script must be updated with the new links to Google Drive.

lhoestq commented 3 years ago

It would be nice to update the urls indeed !

To do this, you just need to replace the urls in iwslt2017.py and then update the dataset_infos.json file with

datasets-cli test ./datasets/iwslt2017 --all_configs --save_infos --ignore_verifications
XuhuiZhou commented 3 years ago

Is this a command to update my local files or fix the file Github repo in general? (I am not so familiar with the datasets-cli command here)

I also took a brief look at the Sharing your dataset section, looks like I could fix that locally and push it to the repo? I guess we are "canonical" category?

lhoestq commented 3 years ago

This command will update your local file. Then you can open a Pull Request to push your fix to the github repo :) And yes you are right, it is a "canonical" dataset, i.e. a dataset script defined in this github repo (as opposed to dataset repositories of users on the huggingface hub)

XuhuiZhou commented 3 years ago

Hi, thanks for the answer.

I gave a try to the problem today. But I encountered an upload error:

git push -u origin fix_link_iwslt
Enter passphrase for key '/home2/xuhuizh/.ssh/id_rsa': 
ERROR: Permission to huggingface/datasets.git denied to XuhuiZhou.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Any insight here?

By the way, when I run the datasets-cli command, it shows the following error, but does not seem to be the error coming from iwslt.py

Traceback (most recent call last):
  File "/home2/xuhuizh/anaconda3/envs/UMT/bin/datasets-cli", line 33, in <module>
    sys.exit(load_entry_point('datasets', 'console_scripts', 'datasets-cli')())
  File "/home2/xuhuizh/projects/datasets/src/datasets/commands/datasets_cli.py", line 35, in main
    service.run()
  File "/home2/xuhuizh/projects/datasets/src/datasets/commands/test.py", line 141, in run
    try_from_hf_gcs=False,
  File "/home2/xuhuizh/projects/datasets/src/datasets/builder.py", line 579, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/home2/xuhuizh/projects/datasets/src/datasets/builder.py", line 639, in _download_and_prepare
    self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
  File "/home2/xuhuizh/projects/datasets/src/datasets/utils/info_utils.py", line 32, in verify_checksums
    raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
datasets.utils.info_utils.ExpectedMoreDownloadedFiles: {'https://wit3.fbk.eu/archive/2017-01-trnmted//texts/DeEnItNlRo/DeEnItNlRo/DeEnItNlRo-DeEnItNlRo.tgz'}
lhoestq commented 3 years ago

Hi ! To create a PR on this repo your must fork it and create a branch on your fork. See how to fork the repo here. And to make the command work without the ExpectedMoreDownloadedFiles error, you just need to use the --ignore_verifications flag.

albertvillanova commented 3 years ago

Hi @XuhuiZhou,

As @lhoestq has well explained, you need to fork HF's repository, create a feature branch in your fork, push your changes to it and then open a Pull Request to HF's upstream repository. This is so because at HuggingFace Datasets we follow a development model called "Fork and Pull Model". You can find more information here:

Alternatively, if you find all these steps too complicated, you can use the GitHub official command line tool: GitHub CLI. Once installed, in order to create a Pull Request, you only need to use this command:

gh pr create --web

This utility will automatically create the fork, push your changes and open a Pull Request, under the hood.