tensorflow / datasets

TFDS is a collection of datasets ready to use with TensorFlow, Jax, ...
https://www.tensorflow.org/datasets
Apache License 2.0
4.29k stars 1.54k forks source link

cnn_dailymail with cloud storage does not work #3747

Open marton-avrios opened 2 years ago

marton-avrios commented 2 years ago

tfds build cnn_dailymail works. fds build cnn_dailymail --data_dir="gs://my-bucket/tensorflow_datasets" doesn't. It gets stuck.

Ubuntu 18.04, Python 3.6.9, TensoFlow 2.6.2, tfds_nightly 4.5.0.dev202201310107.

Output:

2022-02-04 14:20:13.505672: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-04 14:20:13.505725: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-02-04 14:20:19.933676: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-02-04 14:20:19.933738: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-02-04 14:20:19.933775: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (t5-pretraining): /proc/driver/nvidia/version does not exist
INFO[build.py]: Loading dataset cnn_dailymail from imports: tensorflow_datasets.summarization.cnn_dailymail
INFO[dataset_info.py]: Load pre-computed DatasetInfo (eg: splits, num examples,...) from GCS: cnn_dailymail/3.2.0
INFO[dataset_info.py]: Load dataset info from /tmp/tmp9wp5avwhtfds
INFO[dataset_info.py]: Field info.release_notes from disk and from code do not match. Keeping the one from code.
INFO[build.py]: download_and_prepare for dataset cnn_dailymail/3.2.0...
INFO[dataset_builder.py]: Generating dataset cnn_dailymail (gs://avr-datasets/tensorflow_datasets/cnn_dailymail/3.2.0)
Downloading and preparing dataset 558.32 MiB (download: 558.32 MiB, generated: 1.27 GiB, total: 1.82 GiB) to gs://avr-datasets/tensorflow_datasets/cnn_dailymail/3.2.0...
INFO[download_manager.py]: Skipping download of https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ: File cached in gs://avr-datasets/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfTHk4NFg2SndK6PvAAn5U4KkWq9nJaes19wjtFGfX7047F6VnOdZcsgA
INFO[download_manager.py]: Skipping download of https://drive.google.com/uc?export=download&id=0BwmD_VLjROrfM1BxdkxVaTY2bWs: File cached in gs://avr-datasets/tensorflow_datasets/downloads/ucexport_download_id_0BwmD_VLjROrfM1BxdkxVaTY2rWkBAAIhC3xAZxgkjuZuZYaLn2gg8WOqlmNph40UFH4
INFO[download_manager.py]: Skipping download of https://raw.githubusercontent.com/abisee/cnn-dailymail/master/url_lists/all_test.txt: File cached in gs://avr-datasets/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_axPXvtewhJkMKXBVu-9E9DpxMtJAWnlUsOLSlGYGgCb0.txt
INFO[download_manager.py]: Skipping extraction for gs://avr-datasets/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_axPXvtewhJkMKXBVu-9E9DpxMtJAWnlUsOLSlGYGgCb0.txt (method=NO_EXTRACT).
INFO[download_manager.py]: Skipping download of https://raw.githubusercontent.com/abisee/cnn-dailymail/master/url_lists/all_train.txt: File cached in gs://avr-datasets/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_apc7knzpshiwmzikwgjbSqZYlq2yGpDviLVIGsnkNgCk.txt
INFO[download_manager.py]: Skipping extraction for gs://avr-datasets/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_apc7knzpshiwmzikwgjbSqZYlq2yGpDviLVIGsnkNgCk.txt (method=NO_EXTRACT).
INFO[download_manager.py]: Skipping download of https://raw.githubusercontent.com/abisee/cnn-dailymail/master/url_lists/all_val.txt: File cached in gs://avr-datasets/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_agYh-mCsEUINAnG7oOK7ej_S5cpFgW8-yG__EVqFpkds.txt
INFO[download_manager.py]: Skipping extraction for gs://avr-datasets/tensorflow_datasets/downloads/raw.gith.com_abis_cnn-dail_mast_url_list_agYh-mCsEUINAnG7oOK7ej_S5cpFgW8-yG__EVqFpkds.txt (method=NO_EXTRACT).
Dl Completed...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:02<00:00,  1.65 url/s]
Dl Size...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 585439472/585439472 [00:19<00:00, 937175759.00 MiB/s]
Extraction completed...:   0%|                                                                                                                                                     | 0/2 [00:02<?, ? file/s]
Conchylicultor commented 2 years ago

Unfortunately, GCS is very slow when extracting many files.

If you can, try to extract locally, then copy to gcs with gsutils -m to copy files in parallel

From the TFDS side, we should try to parallelize extraction when writing on GCS.