Open XuhuiZhou opened 3 years ago
Hi @XuhuiZhou, thanks for reporting this issue.
Indeed, the old links are no longer valid (404 Not Found error), and the script must be updated with the new links to Google Drive.
It would be nice to update the urls indeed !
To do this, you just need to replace the urls in iwslt2017.py
and then update the dataset_infos.json file with
datasets-cli test ./datasets/iwslt2017 --all_configs --save_infos --ignore_verifications
Is this a command to update my local files or fix the file Github repo in general? (I am not so familiar with the datasets-cli command here)
I also took a brief look at the Sharing your dataset section, looks like I could fix that locally and push it to the repo? I guess we are "canonical" category?
This command will update your local file. Then you can open a Pull Request to push your fix to the github repo :) And yes you are right, it is a "canonical" dataset, i.e. a dataset script defined in this github repo (as opposed to dataset repositories of users on the huggingface hub)
Hi, thanks for the answer.
I gave a try to the problem today. But I encountered an upload error:
git push -u origin fix_link_iwslt
Enter passphrase for key '/home2/xuhuizh/.ssh/id_rsa':
ERROR: Permission to huggingface/datasets.git denied to XuhuiZhou.
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
Any insight here?
By the way, when I run the datasets-cli command, it shows the following error, but does not seem to be the error coming from iwslt.py
Traceback (most recent call last):
File "/home2/xuhuizh/anaconda3/envs/UMT/bin/datasets-cli", line 33, in <module>
sys.exit(load_entry_point('datasets', 'console_scripts', 'datasets-cli')())
File "/home2/xuhuizh/projects/datasets/src/datasets/commands/datasets_cli.py", line 35, in main
service.run()
File "/home2/xuhuizh/projects/datasets/src/datasets/commands/test.py", line 141, in run
try_from_hf_gcs=False,
File "/home2/xuhuizh/projects/datasets/src/datasets/builder.py", line 579, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/home2/xuhuizh/projects/datasets/src/datasets/builder.py", line 639, in _download_and_prepare
self.info.download_checksums, dl_manager.get_recorded_sizes_checksums(), "dataset source files"
File "/home2/xuhuizh/projects/datasets/src/datasets/utils/info_utils.py", line 32, in verify_checksums
raise ExpectedMoreDownloadedFiles(str(set(expected_checksums) - set(recorded_checksums)))
datasets.utils.info_utils.ExpectedMoreDownloadedFiles: {'https://wit3.fbk.eu/archive/2017-01-trnmted//texts/DeEnItNlRo/DeEnItNlRo/DeEnItNlRo-DeEnItNlRo.tgz'}
Hi ! To create a PR on this repo your must fork it and create a branch on your fork. See how to fork the repo here.
And to make the command work without the ExpectedMoreDownloadedFiles
error, you just need to use the --ignore_verifications
flag.
Hi @XuhuiZhou,
As @lhoestq has well explained, you need to fork HF's repository, create a feature branch in your fork, push your changes to it and then open a Pull Request to HF's upstream repository. This is so because at HuggingFace Datasets we follow a development model called "Fork and Pull Model". You can find more information here:
Alternatively, if you find all these steps too complicated, you can use the GitHub official command line tool: GitHub CLI. Once installed, in order to create a Pull Request, you only need to use this command:
gh pr create --web
This utility will automatically create the fork, push your changes and open a Pull Request, under the hood.
The download link in
iwslt2017.py
file does not seem to work anymore.For example,
FileNotFoundError: Couldn't find file at https://wit3.fbk.eu/archive/2017-01-trnted/texts/zh/en/zh-en.tgz
Would be nice if we could modify it script and use the new downloadable link?