huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.25k stars 2.69k forks source link

[Creating new dataset] Not found dataset_info.json #239

Closed richarddwang closed 4 years ago

richarddwang commented 4 years ago

Hi, I am trying to create Toronto Book Corpus. #131

I ran ~/nlp % python nlp-cli test datasets/bookcorpus --save_infos --all_configs but this doesn't create dataset_info.json and try to use it

INFO:nlp.load:Checking datasets/bookcorpus/bookcorpus.py for additional imports.
INFO:filelock:Lock 139795325778640 acquired on datasets/bookcorpus/bookcorpus.py.lock
INFO:nlp.load:Found main folder for dataset datasets/bookcorpus/bookcorpus.py at /home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/datasets/bookcorpus
INFO:nlp.load:Found specific version folder for dataset datasets/bookcorpus/bookcorpus.py at /home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/datasets/bookcorpus/8e84759446cf68d0b0deb3417e60cc331f30a3bbe58843de18a0f48e87d1efd9
INFO:nlp.load:Found script file from datasets/bookcorpus/bookcorpus.py to /home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/datasets/bookcorpus/8e84759446cf68d0b0deb3417e60cc331f30a3bbe58843de18a0f48e87d1efd9/bookcorpus.py
INFO:nlp.load:Couldn't find dataset infos file at datasets/bookcorpus/dataset_infos.json
INFO:nlp.load:Found metadata file for dataset datasets/bookcorpus/bookcorpus.py at /home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/datasets/bookcorpus/8e84759446cf68d0b0deb3417e60cc331f30a3bbe58843de18a0f48e87d1efd9/bookcorpus.json
INFO:filelock:Lock 139795325778640 released on datasets/bookcorpus/bookcorpus.py.lock
INFO:nlp.builder:Overwrite dataset info from restored data version.
INFO:nlp.info:Loading Dataset info from /home/yisiang/.cache/huggingface/datasets/book_corpus/plain_text/1.0.0
Traceback (most recent call last):
  File "nlp-cli", line 37, in <module>
    service.run()
  File "/home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/commands/test.py", line 78, in run
    builders.append(builder_cls(name=config.name, data_dir=self._data_dir))
  File "/home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/builder.py", line 610, in __init__
    super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
  File "/home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/builder.py", line 152, in __init__
    self.info = DatasetInfo.from_directory(self._cache_dir)
  File "/home/yisiang/miniconda3/envs/ml/lib/python3.7/site-packages/nlp/info.py", line 157, in from_directory
    with open(os.path.join(dataset_info_dir, DATASET_INFO_FILENAME), "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/yisiang/.cache/huggingface/datasets/book_corpus/plain_text/1.0.0/dataset_info.json'

btw, ls /home/yisiang/.cache/huggingface/datasets/book_corpus/plain_text/1.0.0/ show me nothing is in the directory.

I have also pushed the script to my fork bookcorpus.py.

lhoestq commented 4 years ago

I think you can just rm this directory and it should be good :)

patrickvonplaten commented 4 years ago

@lhoestq - this seems to happen quite often (already the 2nd issue). Can we maybe delete this automatically?

lhoestq commented 4 years ago

Yes I have an idea of what's going on. I'm sure I can fix that

richarddwang commented 4 years ago

Hi, I rebase my local copy to fix-empty-cache-dir, and try to run again python nlp-cli test datasets/bookcorpus --save_infos --all_configs.

I got this,

Traceback (most recent call last):
  File "nlp-cli", line 10, in <module>
    from nlp.commands.run_beam import RunBeamCommand
  File "/home/yisiang/nlp/src/nlp/commands/run_beam.py", line 6, in <module>
    import apache_beam as beam
ModuleNotFoundError: No module named 'apache_beam'

And after I installed it. I got this

File "/home/yisiang/nlp/src/nlp/datasets/bookcorpus/aea0bd5142d26df645a8fce23d6110bb95ecb81772bb2a1f29012e329191962c/bookcorpus.py", line 88, in _split_generators
    downloaded_path_or_paths = dl_manager.download_custom(_GDRIVE_FILE_ID, download_file_from_google_drive)
  File "/home/yisiang/nlp/src/nlp/utils/download_manager.py", line 128, in download_custom
    downloaded_path_or_paths = map_nested(url_to_downloaded_path, url_or_urls)
  File "/home/yisiang/nlp/src/nlp/utils/py_utils.py", line 172, in map_nested
    return function(data_struct)
  File "/home/yisiang/nlp/src/nlp/utils/download_manager.py", line 126, in url_to_downloaded_path
    return os.path.join(self._download_config.cache_dir, hash_url_to_filename(url))
  File "/home/yisiang/miniconda3/envs/nlppr/lib/python3.7/posixpath.py", line 80, in join
    a = os.fspath(a)

The problem is when I print self._download_config.cache_dir using pdb, it is None.

Did I miss something ? Or can you provide a workaround first so I can keep testing my script ?

richarddwang commented 4 years ago

I'll close this issue because I brings more reports in another issue #249 .