Closed Rajan-sust closed 4 years ago
apache-beam
is not automatically installed with tfds. So you'll have to manually install it, usually with pip install
I saw [1] for downloading Wikipedia dataset but this is not clear to me. Please, @Conchylicultor can you provide me an example code snippet?
pip install apache_beam
apache_beam
is a dependency specific to wikipedia which is not installed by default with tfds, so you have to install it by yourself
To install wikipedia:
python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=wikipedia/<config_name>
Note, it may take some time so you may want to run on Google Cloud instead using dataflow.
My set-up:
pip install tensorflow-datasets==1.3.0
pip install apache_beam
!python -m tensorflow_datasets.scripts.download_and_prepare \
--datasets=wikipedia/20190301.bn
The config_name has been taken from [1]. But finally, an error occurred.
Error:
tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://dumps.wikimedia.your.org/bnwiki/20190301/dumpstatus.json. HTTP code: 404.
Because this URL does not exist anymore.
That's a very good point @Rajan-sust.
It seems that wikipedia is cleaning their old dump statuses. (for example if you try to take 20191001 - it does work).
We currently have the date "hardcoded": https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/text/wikipedia.py#L120
which of course stopped working.
Please don't hesitate to send a PR with the updated config and checksums. However, this would have to be done every few month as the wikipedia data is only kept a limited amount of time.
Not sure there is a long term solution for this.
As per the latest changes 20200301
is latest date so I am running this script python -m tensorflow_datasets.scripts.download_and_prepare --register_checksums --datasets=wikipedia/20200301
locally after changing date to 20200301
in wikipedia.py
but got this error
Traceback (most recent call last): File "C:\Users\eshan\Anaconda3\envs\keras-gpu\lib\runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "C:\Users\eshan\Anaconda3\envs\keras-gpu\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\scripts\download_and_prepare.py", line 219, in
app.run(main) File "C:\Users\eshan\Anaconda3\envs\keras-gpu\lib\site-packages\absl\app.py", line 299, in run _run_main(main, args) File "C:\Users\eshan\Anaconda3\envs\keras-gpu\lib\site-packages\absl\app.py", line 250, in _run_main sys.exit(main(argv)) File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\scripts\download_and_prepare.py", line 184, in main for name in datasets_to_build File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\scripts\download_and_prepare.py", line 184, in for name in datasets_to_build File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\core\registered.py", line 172, in builder return _DATASET_REGISTRYname File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\core\dataset_builder.py", line 1115, in init super(BeamBasedBuilder, self).init(*args, *kwargs) File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\core\api_utils.py", line 52, in disallow_positional_args_dec return fn(args, **kwargs) File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\core\dataset_builder.py", line 188, in init self._builder_config = self._create_builder_config(config) File "C:\Users\eshan\Desktop\Tensoflow\Tfds\datasets\tensorflow_datasets\core\dataset_builder.py", line 792, in _create_builder_config (name, list(self.builder_configs.keys()))) ValueError: BuilderConfig 20200301 not found. Available: ['20200301.aa', '20200301.ab', '20200301.ace', '20200301.ady', '20200301.af', '20200301.ak', '20200301.als', '20200301.am', '20200301.an', '20200301.ang', '20200301.ar', '20200301.arc', '20200301.arz', '20200301.as', '20200301.ast', '20200301.atj', '20200301.av', '20200301.ay', '20200301.az', '20200301.azb', '20200301.ba', '20200301.bar', '20200301.bat-smg', '20200301.bcl', '20200301.be', '20200301.be-x-old', '20200301.bg', '20200301.bh', '20200301.bi', '20200301.bjn', '20200301.bm', '20200301.bn', '20200301.bo', '20200301.bpy', '20200301.br', '20200301.bs', '20200301.bug', '20200301.bxr', '20200301.ca', '20200301.cbk-zam', '20200301.cdo', '20200301.ce', '20200301.ceb', '20200301.ch', '20200301.cho', '20200301.chr', '20200301.chy', '20200301.ckb', '20200301.co', '20200301.cr', '20200301.crh', '20200301.cs', '20200301.csb', '20200301.cu', '20200301.cv', '20200301.cy', '20200301.da', '20200301.de', '20200301.din', '20200301.diq', '20200301.dsb', '20200301.dty', '20200301.dv', '20200301.dz', '20200301.ee', '20200301.el', '20200301.eml', '20200301.en', '20200301.eo', '20200301.es', '20200301.et', '20200301.eu', '20200301.ext', '20200301.fa', '20200301.ff', '20200301.fi', '20200301.fiu-vro', '20200301.fj', '20200301.fo', '20200301.fr', '20200301.frp', '20200301.frr', '20200301.fur', '20200301.fy', '20200301.ga', '20200301.gag', '20200301.gan', '20200301.gd', '20200301.gl', '20200301.glk', '20200301.gn', '20200301.gom', '20200301.gor', '20200301.got', '20200301.gu', '20200301.gv', '20200301.ha', '20200301.hak', '20200301.haw', '20200301.he', '20200301.hi', '20200301.hif', '20200301.ho', '20200301.hr', '20200301.hsb', '20200301.ht', '20200301.hu', '20200301.hy', '20200301.ia', '20200301.id', '20200301.ie', '20200301.ig', '20200301.ii', '20200301.ik', '20200301.ilo', '20200301.inh', '20200301.io', '20200301.is', '20200301.it', '20200301.iu', '20200301.ja', '20200301.jam', '20200301.jbo', '20200301.jv', '20200301.ka', '20200301.kaa', '20200301.kab', '20200301.kbd', '20200301.kbp', '20200301.kg', '20200301.ki', '20200301.kj', '20200301.kk', '20200301.kl', '20200301.km', '20200301.kn', '20200301.ko', '20200301.koi', '20200301.krc', '20200301.ks', '20200301.ksh', '20200301.ku', '20200301.kv', '20200301.kw', '20200301.ky', '20200301.la', '20200301.lad', '20200301.lb', '20200301.lbe', '20200301.lez', '20200301.lfn', '20200301.lg', '20200301.li', '20200301.lij', '20200301.lmo', '20200301.ln', '20200301.lo', '20200301.lrc', '20200301.lt', '20200301.ltg', '20200301.lv', '20200301.mai', '20200301.map-bms', '20200301.mdf', '20200301.mg', '20200301.mh', '20200301.mhr', '20200301.mi', '20200301.min', '20200301.mk', '20200301.ml', '20200301.mn', '20200301.mr', '20200301.mrj', '20200301.ms', '20200301.mt', '20200301.mus', '20200301.mwl', '20200301.my', '20200301.myv', '20200301.mzn', '20200301.na', '20200301.nah', '20200301.nap', '20200301.nds', '20200301.nds-nl', '20200301.ne', '20200301.new', '20200301.ng', '20200301.nl', '20200301.nn', '20200301.no', '20200301.nov', '20200301.nrm', '20200301.nso', '20200301.nv', '20200301.ny', '20200301.oc', '20200301.olo', '20200301.om', '20200301.or', '20200301.os', '20200301.pa', '20200301.pag', '20200301.pam', '20200301.pap', '20200301.pcd', '20200301.pdc', '20200301.pfl', '20200301.pi', '20200301.pih', '20200301.pl', '20200301.pms', '20200301.pnb', '20200301.pnt', '20200301.ps', '20200301.pt', '20200301.qu', '20200301.rm', '20200301.rmy', '20200301.rn', '20200301.ro', '20200301.roa-rup', '20200301.roa-tara', '20200301.ru', '20200301.rue', '20200301.rw', '20200301.sa', '20200301.sah', '20200301.sat', '20200301.sc', '20200301.scn', '20200301.sco', '20200301.sd', '20200301.se', '20200301.sg', '20200301.sh', '20200301.si', '20200301.simple', '20200301.sk', '20200301.sl', '20200301.sm', '20200301.sn', '20200301.so', '20200301.sq', '20200301.sr', '20200301.srn', '20200301.ss', '20200301.st', '20200301.stq', '20200301.su', '20200301.sv', '20200301.sw', '20200301.szl', '20200301.ta', '20200301.tcy', '20200301.te', '20200301.tet', '20200301.tg', '20200301.th', '20200301.ti', '20200301.tk', '20200301.tl', '20200301.tn', '20200301.to', '20200301.tpi', '20200301.tr', '20200301.ts', '20200301.tt', '20200301.tum', '20200301.tw', '20200301.ty', '20200301.tyv', '20200301.udm', '20200301.ug', '20200301.uk', '20200301.ur', '20200301.uz', '20200301.ve', '20200301.vec', '20200301.vep', '20200301.vi', '20200301.vls', '20200301.vo', '20200301.wa', '20200301.war', '20200301.wo', '20200301.wuu', '20200301.xal', '20200301.xh', '20200301.xmf', '20200301.yi', '20200301.yo', '20200301.za', '20200301.zea', '20200301.zh', '20200301.zh-classical', '20200301.zh-min-nan', '20200301.zh-yue', '20200301.zu']
Is this error can be generated because of low disk space also ?
Short description ModuleNotFoundError: No module named 'apache_beam'
Environment information
tensorflow-datasets
/tfds-nightly
version:tensorflow-datasets 1.3.0
tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version:tensorflow 2.0
Reproduction instructions
Errors
Downloading and preparing dataset wikipedia (44.09 KiB) to /root/tensorflow_datasets/wikipedia/20190301.aa/0.0.3...
ModuleNotFoundError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/tensorflow_datasets/core/lazy_imports_lib.py in _try_import(module_name) 29 try: ---> 30 mod = importlib.import_module(module_name) 31 return mod
19 frames ModuleNotFoundError: No module named 'apache_beam'
During handling of the above exception, another exception occurred:
ModuleNotFoundError Traceback (most recent call last) /usr/lib/python3.6/importlib/_bootstrap.py in _find_and_loadunlocked(name, import)
ModuleNotFoundError: No module named 'apache_beam' Tried importing %s but failed. See setup.py extras_require. The dataset you are trying to use may have additional dependencies.