Closed gaguilar closed 1 year ago
Some wikipedia languages have already been processed by us and are hosted on our google storage. This is the case for "fr" and "en" for example.
For other smaller languages (in terms of bytes), they are directly downloaded and parsed from the wikipedia dump site. Parsing can take some time for languages with hundreds of MB of xml.
Let me know if you encounter an error or if you feel that is is taking too long for you. We could process those that really take too much time
Ok, thanks for clarifying, that makes sense. I will time those examples later today and post back here.
Also, it seems that not all dumps should use the same date. For instance, I was checking the Spanish dump doing the following:
data = nlp.load_dataset('wikipedia', '20200501.es', beam_runner='DirectRunner', split='train')
I got the error below because this URL does not exist: https://dumps.wikimedia.org/eswiki/20200501/dumpstatus.json. So I checked the actual available dates here https://dumps.wikimedia.org/eswiki/ and there is no 20200501. If one tries for a date available in the link, then the nlp library does not allow such a request because is not in the list of expected datasets.
Downloading and preparing dataset wikipedia/20200501.es (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/gaguilar/.cache/huggingface/datasets/wikipedia/20200501.es/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
builder_instance.download_and_prepare(
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
self._download_and_prepare(
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 965, in _download_and_prepare
super(BeamBasedBuilder, self)._download_and_prepare(
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 518, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 422, in _split_generators
downloaded_files = dl_manager.download_and_extract({"info": info_url})
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/download_manager.py", line 220, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/download_manager.py", line 155, in download
downloaded_path_or_paths = map_nested(
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/py_utils.py", line 163, in map_nested
return {
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/py_utils.py", line 164, in <dictcomp>
k: map_nested(
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/py_utils.py", line 191, in map_nested
return function(data_struct)
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/download_manager.py", line 156, in <lambda>
lambda url: cached_path(url, download_config=self._download_config,), url_or_urls,
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/file_utils.py", line 191, in cached_path
output_path = get_from_cache(
File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/file_utils.py", line 356, in get_from_cache
raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://dumps.wikimedia.org/eswiki/20200501/dumpstatus.json
Thanks ! This will be very helpful.
About the date issue, I think it's possible to use another date with
load_dataset("wikipedia", language="es", date="...", beam_runner="...")
However we've not processed wikipedia dumps for other dates than 20200501 (yet ?)
One more thing that is specific to 20200501.es: it was available once but the mwparserfromhell
was not able to parse it for some reason, so we didn't manage to get a processed version of 20200501.es (see #321 )
Cool! Thanks for the trick regarding different dates!
I checked the download/processing time for retrieving the Arabic Wikipedia dump, and it took about 3.2 hours. I think that this may be a bit impractical when it comes to working with multiple languages (although I understand that storing those datasets in your Google storage may not be very appealing either).
For the record, here's what I did:
import nlp
import time
def timeit(filename):
elapsed = time.time()
data = nlp.load_dataset('wikipedia', filename, beam_runner='DirectRunner', split='train')
elapsed = time.time() - elapsed
print(f"Loading the '{filename}' data took {elapsed:,.1f} seconds...")
return data
data = timeit('20200501.ar')
Here's the output:
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.0k/13.0k [00:00<00:00, 8.34MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 954kB/s]
Downloading and preparing dataset wikipedia/20200501.ar (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/gaguil20/.cache/huggingface/datasets/wikipedia/20200501.ar/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50...
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47.4k/47.4k [00:00<00:00, 1.40MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79.8M/79.8M [00:15<00:00, 5.13MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171M/171M [00:33<00:00, 5.13MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103M/103M [00:20<00:00, 5.14MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 227M/227M [00:44<00:00, 5.06MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140M/140M [00:28<00:00, 4.96MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160M/160M [00:30<00:00, 5.20MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97.5M/97.5M [00:19<00:00, 5.06MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 222M/222M [00:42<00:00, 5.21MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:16<00:00, 196.39s/sources]
Dataset wikipedia downloaded and prepared to /home/gaguil20/.cache/huggingface/datasets/wikipedia/20200501.ar/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50. Subsequent calls will reuse this data.
Loading the '20200501.ar' data took 11,582.7 seconds...
About the date issue, I think it's possible to use another date with
load_dataset("wikipedia", language="es", date="...", beam_runner="...")
I tried your suggestion about the date and the function does not accept the language and date keywords. I tried both on nlp
v0.4 and the new datasets
library (v1.0.2):
load_dataset("wikipedia", language="es", date="20200601", beam_runner='DirectRunner', split='train')
For now, my quick workaround to keep things moving was to simply change the date inside the library at this line: https://github.com/huggingface/datasets/blob/master/datasets/wikipedia/wikipedia.py#L403
Note that the date and languages are valid: https://dumps.wikimedia.org/eswiki/20200601/dumpstatus.json
Any suggestion is welcome :) @lhoestq
The workaround I mentioned fetched the data, but then I faced another issue (even the log says to report this as bug):
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
Here's the full stack (which says that there is a key error caused by this key: KeyError: '000nbsp'
):
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74.7k/74.7k [00:00<00:00, 1.53MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232M/232M [00:48<00:00, 4.75MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 442M/442M [01:39<00:00, 4.44MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 173M/173M [00:33<00:00, 5.12MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 344M/344M [01:14<00:00, 4.59MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 541M/541M [01:59<00:00, 4.52MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 476M/476M [01:31<00:00, 5.18MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 545M/545M [02:02<00:00, 4.46MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299M/299M [01:01<00:00, 4.89MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.60M/9.60M [00:01<00:00, 4.84MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 423M/423M [01:36<00:00, 4.38MB/s]
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--lang', 'es', '--date', '20200601', '--tokenizer', 'bert-base-multilingual-cased', '--cache', 'train', 'valid', '--max_dataset_length', '200000', '10000']
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1095, in apache_beam.runners.common._OutputProcessor.process_outputs
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 500, in _clean_content
text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 556, in _parse_and_clean_wikicode
section_text.append(section.strip_code().strip())
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/wikicode.py", line 643, in strip_code
stripped = node.__strip__(**kwargs)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 63, in __strip__
return self.normalize()
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 178, in normalize
return chrfunc(htmlentities.name2codepoint[self.value])
KeyError: '000nbsp'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/raid/data/gustavoag/projects/char2subword/research/preprocessing/split_wiki.py", line 96, in <module>
main()
File "/raid/data/gustavoag/projects/char2subword/research/preprocessing/split_wiki.py", line 65, in main
data = nlp.load_dataset('wikipedia', f'{args.date}.{args.lang}', beam_runner='DirectRunner', split='train')
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
builder_instance.download_and_prepare(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
self._download_and_prepare(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 969, in _download_and_prepare
pipeline_results = pipeline.run()
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/pipeline.py", line 534, in run
return self.runner.run_pipeline(self, self._options)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/direct/direct_runner.py", line 119, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 172, in run_pipeline
self._latest_run_result = self.run_via_runner_api(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 183, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 338, in run_stages
stage_results = self._run_stage(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 512, in _run_stage
last_result, deferred_inputs, fired_timers = self._run_bundle(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 556, in _run_bundle
result, splits = bundle_manager.process_bundle(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 940, in process_bundle
for result, split_result in executor.map(execute, zip(part_inputs, # pylint: disable=zip-builtin-not-iterating
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
yield fs.pop().result()
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
raise self._exception
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 932, in execute
return bundle_manager.process_bundle(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 837, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 352, in push
response = self.worker.do_instruction(request)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 479, in do_instruction
return getattr(self, request_type)(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 515, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle
input_op_by_transform_id[element.transform_id].process_encoded(
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded
self.output(decoded_value)
File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1030, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1122, in apache_beam.runners.common._OutputProcessor.process_outputs
File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1030, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1122, in apache_beam.runners.common._OutputProcessor.process_outputs
File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1045, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
raise exc.with_traceback(traceback)
File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1095, in apache_beam.runners.common._OutputProcessor.process_outputs
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 500, in _clean_content
text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 556, in _parse_and_clean_wikicode
section_text.append(section.strip_code().strip())
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/wikicode.py", line 643, in strip_code
stripped = node.__strip__(**kwargs)
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 63, in __strip__
return self.normalize()
File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 178, in normalize
return chrfunc(htmlentities.name2codepoint[self.value])
KeyError: "000nbsp [while running 'train/Clean content']"```
@lhoestq Any updates on this? I have similar issues with the Romanian dump, tnx.
Hey @gaguilar ,
I just found the "char2subword" paper and I'm really interested in trying it out on own vocabs/datasets like for historical texts (I've already trained some lms on newspaper articles with OCR errors).
Do you plan to release the code for your paper or is it possible to get the implementation 🤔 Many thanks :hugs:
Hi @stefan-it! Thanks for your interest in our work! We do plan to release the code, but we will make it available once the paper has been published at a conference. Sorry for the inconvenience!
Hi @lhoestq, do you have any insights for this issue by any chance? Thanks!
This is an issue on the mwparserfromhell
side. You could try to update mwparserfromhell
and see if it fixes the issue. If it doesn't we'll have to create an issue on their repo for them to fix it.
But first let's see if the latest version of mwparserfromhell
does the job.
I think the work around as suggested in the issue [#886] is not working for several languages, such as id
. For example, I tried all the dates to download dataset for id
langauge from the following link: (https://github.com/huggingface/datasets/pull/886) https://dumps.wikimedia.org/idwiki/
dataset = load_dataset('wikipedia', language='id', date="20210501", beam_runner='DirectRunner') WARNING:datasets.builder:Using custom data configuration 20210501.id-date=20210501,language=id Downloading and preparing dataset wikipedia/20210501.id (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /Users/.cache/huggingface/datasets/wikipedia/20210501.id-date=20210501,language=id/0.0.0/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1... Traceback (most recent call last): File "
", line 1, in File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/load.py", line 745, in load_dataset builder_instance.download_and_prepare( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/builder.py", line 574, in download_and_prepare self._download_and_prepare( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/builder.py", line 1139, in _download_and_prepare super(BeamBasedBuilder, self)._download_and_prepare( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/builder.py", line 630, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "/Users/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 420, in _split_generators downloaded_files = dl_manager.download_and_extract({"info": info_url}) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 287, in download_and_extract return self.extract(self.download(url_or_urls)) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 195, in download downloaded_path_or_paths = map_nested( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 203, in map_nested mapped = [ File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 204, in _single_map_nested((function, obj, types, None, True)) for obj in tqdm(iterable, disable=disable_tqdm) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 142, in _single_map_nested return function(data_struct) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 218, in _download return cached_path(url_or_filename, download_config=download_config) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 281, in cached_path output_path = get_from_cache( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 623, in get_from_cache raise ConnectionError("Couldn't reach {}".format(url)) ConnectionError: Couldn't reach https://dumps.wikimedia.org/idwiki/20210501/dumpstatus.json
Moreover the downloading speed for non-en
language is very very slow. And interestingly the download stopped after approx a couple minutes due to the read time-out. I tried numerous times and the results is same. Is there any feasible way to download non-en language using huggingface?
File "/Users/miislamg/opt/anaconda3/envs/proj-semlm/lib/python3.9/site-packages/requests/models.py", line 760, in generate raise ConnectionError(e) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dumps.wikimedia.org', port=443): Read timed out. Downloading: 7%|████████▎ | 10.2M/153M [03:35<50:07, 47.4kB/s]
Hi ! The link https://dumps.wikimedia.org/idwiki/20210501/dumpstatus.json seems to be working fine for me.
Regarding the time outs, it must come either from an issue on the wikimedia host side, or from your internet connection. Feel free to try again several times.
I was trying to download dataset for es
language, however I am getting the following error:
dataset = load_dataset('wikipedia', language='es', date="20210320", beam_runner='DirectRunner')
Downloading and preparing dataset wikipedia/20210320.es (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /scratch/user_name/datasets/wikipedia/20210320.es-date=20210320,language=es/0.0.0/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1...
Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs
File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 492, in _clean_content
text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 548, in _parse_and_clean_wikicode
section_text.append(section.strip_code().strip())
File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/wikicode.py", line 639, in strip_code
stripped = node.__strip__(**kwargs)
File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 60, in __strip__
return self.normalize()
File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 150, in normalize
return chr(htmlentities.name2codepoint[self.value])
KeyError: '000nbsp'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "download_dataset_all.py", line 8, in <module>
dataset = load_dataset('wikipedia', language=language, date="20210320", beam_runner='DirectRunner')
File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 748, in load_dataset
use_auth_token=use_auth_token,
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 575, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1152, in _download_and_prepare
pipeline_results = pipeline.run()
File "/opt/conda/lib/python3.7/site-packages/apache_beam/pipeline.py", line 564, in run
return self.runner.run_pipeline(self, self._options)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 131, in run_pipeline
return runner.run_pipeline(pipeline, options)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 190, in run_pipeline
pipeline.to_runner_api(default_environment=self._default_environment))
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 200, in run_via_runner_api
return self.run_stages(stage_context, stages)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 366, in run_stages
bundle_context_manager,
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 562, in _run_stage
bundle_manager)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 602, in _run_bundle
data_input, data_output, input_timers, expected_timer_output)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 903, in process_bundle
result_future = self._worker_handler.control_conn.push(process_bundle_req)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 378, in push
response = self.worker.do_instruction(request)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 610, in do_instruction
getattr(request, request_type), request.instruction_id)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 647, in process_bundle
bundle_processor.process_bundle(instruction_id))
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1001, in process_bundle
element.data)
File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 229, in process_encoded
self.output(decoded_value)
File "apache_beam/runners/worker/operations.py", line 356, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 358, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 220, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 717, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1235, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1300, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1395, in apache_beam.runners.common._OutputProcessor.process_outputs
File "apache_beam/runners/worker/operations.py", line 220, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 717, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1235, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1300, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1395, in apache_beam.runners.common._OutputProcessor.process_outputs
File "apache_beam/runners/worker/operations.py", line 220, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "apache_beam/runners/worker/operations.py", line 717, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process
File "apache_beam/runners/common.py", line 1235, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented
File "/opt/conda/lib/python3.7/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
raise exc.with_traceback(traceback)
File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs
File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 492, in _clean_content
text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 548, in _parse_and_clean_wikicode
section_text.append(section.strip_code().strip())
File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/wikicode.py", line 639, in strip_code
stripped = node.__strip__(**kwargs)
File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 60, in __strip__
return self.normalize()
File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 150, in normalize
return chr(htmlentities.name2codepoint[self.value])
KeyError: "000nbsp [while running 'train/Clean content']"
Hi ! This looks related to this issue: https://github.com/huggingface/datasets/issues/1994
Basically the parser that is used (mwparserfromhell) has some issues for some pages in es
.
We already reported some issues for es
on their repo at https://github.com/earwig/mwparserfromhell/issues/247 but it looks like there are still a few issues. Might be a good idea to open a new issue on the mwparserfromhell repo
Any updates on this so far?
The issue:
KeyError: "000nbsp [while running 'train/Clean content']"
reported in comments:
was normally fixed in the mwparserfromhell
library and will be accessible in their next release version 0.7
:
mwparserfromhell 0.7 has still not been released, but you might have luck with the dev version:
pip install git+https://github.com/earwig/mwparserfromhell.git@0f89f44
Hi,
I am working with the
wikipedia
dataset and I have a script that goes over 92 of the available languages in that dataset. So far I have detected thatar
,af
,an
are not loading. Other languages likefr
anden
are working fine. Here's how I am loading them:Here's what I see for 'ar' (it gets stuck there):
Note that those languages are indeed in the list of expected languages. Any suggestions on how to work around this? Thanks!