Some languages in wikipedia dataset are not loading

gaguilar commented 4 years ago

Hi,

I am working with the wikipedia dataset and I have a script that goes over 92 of the available languages in that dataset. So far I have detected that ar, af, an are not loading. Other languages like fr and en are working fine. Here's how I am loading them:

import nlp

langs = ['ar'. 'af', 'an']

for lang in langs:
    data = nlp.load_dataset('wikipedia', f'20200501.{lang}', beam_runner='DirectRunner', split='train') 
    print(lang, len(data))

Here's what I see for 'ar' (it gets stuck there):

Downloading and preparing dataset wikipedia/20200501.ar (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/gaguilar/.cache/huggingface/datasets/wikipedia/20200501.ar/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50...

Note that those languages are indeed in the list of expected languages. Any suggestions on how to work around this? Thanks!

lhoestq commented 4 years ago

Some wikipedia languages have already been processed by us and are hosted on our google storage. This is the case for "fr" and "en" for example.

For other smaller languages (in terms of bytes), they are directly downloaded and parsed from the wikipedia dump site. Parsing can take some time for languages with hundreds of MB of xml.

Let me know if you encounter an error or if you feel that is is taking too long for you. We could process those that really take too much time

gaguilar commented 4 years ago

Ok, thanks for clarifying, that makes sense. I will time those examples later today and post back here.

Also, it seems that not all dumps should use the same date. For instance, I was checking the Spanish dump doing the following:

data = nlp.load_dataset('wikipedia', '20200501.es', beam_runner='DirectRunner', split='train')

I got the error below because this URL does not exist: https://dumps.wikimedia.org/eswiki/20200501/dumpstatus.json. So I checked the actual available dates here https://dumps.wikimedia.org/eswiki/ and there is no 20200501. If one tries for a date available in the link, then the nlp library does not allow such a request because is not in the list of expected datasets.

Downloading and preparing dataset wikipedia/20200501.es (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/gaguilar/.cache/huggingface/datasets/wikipedia/20200501.es/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 965, in _download_and_prepare
    super(BeamBasedBuilder, self)._download_and_prepare(
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 518, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 422, in _split_generators
    downloaded_files = dl_manager.download_and_extract({"info": info_url})
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/download_manager.py", line 220, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/download_manager.py", line 155, in download
    downloaded_path_or_paths = map_nested(
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/py_utils.py", line 163, in map_nested
    return {
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/py_utils.py", line 164, in <dictcomp>
    k: map_nested(
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/py_utils.py", line 191, in map_nested
    return function(data_struct)
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/download_manager.py", line 156, in <lambda>
    lambda url: cached_path(url, download_config=self._download_config,), url_or_urls,
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/file_utils.py", line 191, in cached_path
    output_path = get_from_cache(
  File "/home/gaguilar/.conda/envs/pytorch/lib/python3.8/site-packages/nlp/utils/file_utils.py", line 356, in get_from_cache
    raise ConnectionError("Couldn't reach {}".format(url))
ConnectionError: Couldn't reach https://dumps.wikimedia.org/eswiki/20200501/dumpstatus.json

lhoestq commented 4 years ago

Thanks ! This will be very helpful.

About the date issue, I think it's possible to use another date with

load_dataset("wikipedia", language="es", date="...", beam_runner="...")

However we've not processed wikipedia dumps for other dates than 20200501 (yet ?)

One more thing that is specific to 20200501.es: it was available once but the mwparserfromhell was not able to parse it for some reason, so we didn't manage to get a processed version of 20200501.es (see #321 )

gaguilar commented 4 years ago

Cool! Thanks for the trick regarding different dates!

I checked the download/processing time for retrieving the Arabic Wikipedia dump, and it took about 3.2 hours. I think that this may be a bit impractical when it comes to working with multiple languages (although I understand that storing those datasets in your Google storage may not be very appealing either).

For the record, here's what I did:

import nlp
import time

def timeit(filename):
    elapsed = time.time()
    data = nlp.load_dataset('wikipedia', filename, beam_runner='DirectRunner', split='train')
    elapsed = time.time() - elapsed
    print(f"Loading the '{filename}' data took {elapsed:,.1f} seconds...")
    return data

data = timeit('20200501.ar')

Here's the output:

Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13.0k/13.0k [00:00<00:00, 8.34MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28.7k/28.7k [00:00<00:00, 954kB/s]
Downloading and preparing dataset wikipedia/20200501.ar (download: Unknown size, generated: Unknown size, post-processed: Unknown sizetotal: Unknown size) to /home/gaguil20/.cache/huggingface/datasets/wikipedia/20200501.ar/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50...
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 47.4k/47.4k [00:00<00:00, 1.40MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 79.8M/79.8M [00:15<00:00, 5.13MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 171M/171M [00:33<00:00, 5.13MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 103M/103M [00:20<00:00, 5.14MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 227M/227M [00:44<00:00, 5.06MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 140M/140M [00:28<00:00, 4.96MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160M/160M [00:30<00:00, 5.20MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 97.5M/97.5M [00:19<00:00, 5.06MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 222M/222M [00:42<00:00, 5.21MB/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [03:16<00:00, 196.39s/sources]
Dataset wikipedia downloaded and prepared to /home/gaguil20/.cache/huggingface/datasets/wikipedia/20200501.ar/1.0.0/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50. Subsequent calls will reuse this data.
Loading the '20200501.ar' data took 11,582.7 seconds...

gaguilar commented 3 years ago

About the date issue, I think it's possible to use another date with
load_dataset("wikipedia", language="es", date="...", beam_runner="...")

I tried your suggestion about the date and the function does not accept the language and date keywords. I tried both on nlp v0.4 and the new datasets library (v1.0.2):

load_dataset("wikipedia", language="es", date="20200601", beam_runner='DirectRunner', split='train')

For now, my quick workaround to keep things moving was to simply change the date inside the library at this line: https://github.com/huggingface/datasets/blob/master/datasets/wikipedia/wikipedia.py#L403

Note that the date and languages are valid: https://dumps.wikimedia.org/eswiki/20200601/dumpstatus.json

Any suggestion is welcome :) @lhoestq

[UPDATE]

The workaround I mentioned fetched the data, but then I faced another issue (even the log says to report this as bug):

ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.

Here's the full stack (which says that there is a key error caused by this key: KeyError: '000nbsp'):


Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 74.7k/74.7k [00:00<00:00, 1.53MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 232M/232M [00:48<00:00, 4.75MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 442M/442M [01:39<00:00, 4.44MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 173M/173M [00:33<00:00, 5.12MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 344M/344M [01:14<00:00, 4.59MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 541M/541M [01:59<00:00, 4.52MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 476M/476M [01:31<00:00, 5.18MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 545M/545M [02:02<00:00, 4.46MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 299M/299M [01:01<00:00, 4.89MB/s]
Downloading: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9.60M/9.60M [00:01<00:00, 4.84MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 423M/423M [01:36<00:00, 4.38MB/s]
WARNING:apache_beam.options.pipeline_options:Discarding unparseable args: ['--lang', 'es', '--date', '20200601', '--tokenizer', 'bert-base-multilingual-cased', '--cache', 'train', 'valid', '--max_dataset_length', '200000', '10000']

ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
ERROR:root:mwparserfromhell ParseError: This is a bug and should be reported. Info: C tokenizer exited with non-empty token stack.
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1095, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 500, in _clean_content
    text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 556, in _parse_and_clean_wikicode
    section_text.append(section.strip_code().strip())
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/wikicode.py", line 643, in strip_code
    stripped = node.__strip__(**kwargs)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 63, in __strip__
    return self.normalize()
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 178, in normalize
    return chrfunc(htmlentities.name2codepoint[self.value])
KeyError: '000nbsp'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/raid/data/gustavoag/projects/char2subword/research/preprocessing/split_wiki.py", line 96, in <module>
    main()
  File "/raid/data/gustavoag/projects/char2subword/research/preprocessing/split_wiki.py", line 65, in main
    data = nlp.load_dataset('wikipedia', f'{args.date}.{args.lang}', beam_runner='DirectRunner', split='train')
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/load.py", line 548, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 462, in download_and_prepare
    self._download_and_prepare(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/builder.py", line 969, in _download_and_prepare
    pipeline_results = pipeline.run()
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/pipeline.py", line 534, in run
    return self.runner.run_pipeline(self, self._options)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/direct/direct_runner.py", line 119, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 172, in run_pipeline
    self._latest_run_result = self.run_via_runner_api(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 183, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 338, in run_stages
    stage_results = self._run_stage(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 512, in _run_stage
    last_result, deferred_inputs, fired_timers = self._run_bundle(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 556, in _run_bundle
    result, splits = bundle_manager.process_bundle(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 940, in process_bundle
    for result, split_result in executor.map(execute, zip(part_inputs,  # pylint: disable=zip-builtin-not-iterating
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/concurrent/futures/_base.py", line 611, in result_iterator
    yield fs.pop().result()
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/utils/thread_pool_executor.py", line 44, in run
    self._future.set_result(self._fn(*self._fn_args, **self._fn_kwargs))
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 932, in execute
    return bundle_manager.process_bundle(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 837, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 352, in push
    response = self.worker.do_instruction(request)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 479, in do_instruction
    return getattr(self, request_type)(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 515, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 977, in process_bundle
    input_op_by_transform_id[element.transform_id].process_encoded(
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/apache_beam/runners/worker/bundle_processor.py", line 218, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 330, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1030, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1122, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1030, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1122, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 963, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1045, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 961, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 553, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1095, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 500, in _clean_content
    text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/nlp/datasets/wikipedia/7be7f4324255faf70687be8692de57cf79197afdc33ff08d6a04ed602df32d50/wikipedia.py", line 556, in _parse_and_clean_wikicode
    section_text.append(section.strip_code().strip())
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/wikicode.py", line 643, in strip_code
    stripped = node.__strip__(**kwargs)
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 63, in __strip__
    return self.normalize()
  File "/home/gustavoag/anaconda3/envs/pytorch/lib/python3.8/site-packages/mwparserfromhell/nodes/html_entity.py", line 178, in normalize
    return chrfunc(htmlentities.name2codepoint[self.value])
KeyError: "000nbsp [while running 'train/Clean content']"```

D-Roberts commented 3 years ago

@lhoestq Any updates on this? I have similar issues with the Romanian dump, tnx.

stefan-it commented 3 years ago

Hey @gaguilar ,

I just found the "char2subword" paper and I'm really interested in trying it out on own vocabs/datasets like for historical texts (I've already trained some lms on newspaper articles with OCR errors).

Do you plan to release the code for your paper or is it possible to get the implementation 🤔 Many thanks :hugs:

gaguilar commented 3 years ago

Hi @stefan-it! Thanks for your interest in our work! We do plan to release the code, but we will make it available once the paper has been published at a conference. Sorry for the inconvenience!

Hi @lhoestq, do you have any insights for this issue by any chance? Thanks!

lhoestq commented 3 years ago

This is an issue on the mwparserfromhell side. You could try to update mwparserfromhell and see if it fixes the issue. If it doesn't we'll have to create an issue on their repo for them to fix it. But first let's see if the latest version of mwparserfromhell does the job.

mmiakashs commented 3 years ago

I think the work around as suggested in the issue [#886] is not working for several languages, such as id. For example, I tried all the dates to download dataset for id langauge from the following link: (https://github.com/huggingface/datasets/pull/886) https://dumps.wikimedia.org/idwiki/

dataset = load_dataset('wikipedia', language='id', date="20210501", beam_runner='DirectRunner') WARNING:datasets.builder:Using custom data configuration 20210501.id-date=20210501,language=id Downloading and preparing dataset wikipedia/20210501.id (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /Users/.cache/huggingface/datasets/wikipedia/20210501.id-date=20210501,language=id/0.0.0/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1... Traceback (most recent call last): File "", line 1, in File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/load.py", line 745, in load_dataset builder_instance.download_and_prepare( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/builder.py", line 574, in download_and_prepare self._download_and_prepare( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/builder.py", line 1139, in _download_and_prepare super(BeamBasedBuilder, self)._download_and_prepare( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/builder.py", line 630, in _download_and_prepare split_generators = self._split_generators(dl_manager, **split_generators_kwargs) File "/Users/.cache/huggingface/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 420, in _split_generators downloaded_files = dl_manager.download_and_extract({"info": info_url}) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 287, in download_and_extract return self.extract(self.download(url_or_urls)) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 195, in download downloaded_path_or_paths = map_nested( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 203, in map_nested mapped = [ File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 204, in _single_map_nested((function, obj, types, None, True)) for obj in tqdm(iterable, disable=disable_tqdm) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 142, in _single_map_nested return function(data_struct) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/download_manager.py", line 218, in _download return cached_path(url_or_filename, download_config=download_config) File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 281, in cached_path output_path = get_from_cache( File "/Users/opt/anaconda3/envs/proj/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 623, in get_from_cache raise ConnectionError("Couldn't reach {}".format(url)) ConnectionError: Couldn't reach https://dumps.wikimedia.org/idwiki/20210501/dumpstatus.json

Moreover the downloading speed for non-en language is very very slow. And interestingly the download stopped after approx a couple minutes due to the read time-out. I tried numerous times and the results is same. Is there any feasible way to download non-en language using huggingface?

File "/Users/miislamg/opt/anaconda3/envs/proj-semlm/lib/python3.9/site-packages/requests/models.py", line 760, in generate raise ConnectionError(e) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='dumps.wikimedia.org', port=443): Read timed out. Downloading: 7%|████████▎ | 10.2M/153M [03:35<50:07, 47.4kB/s]

lhoestq commented 3 years ago

Hi ! The link https://dumps.wikimedia.org/idwiki/20210501/dumpstatus.json seems to be working fine for me.

Regarding the time outs, it must come either from an issue on the wikimedia host side, or from your internet connection. Feel free to try again several times.

mmiakashs commented 3 years ago

I was trying to download dataset for es language, however I am getting the following error:

dataset = load_dataset('wikipedia', language='es', date="20210320", beam_runner='DirectRunner')

Downloading and preparing dataset wikipedia/20210320.es (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /scratch/user_name/datasets/wikipedia/20210320.es-date=20210320,language=es/0.0.0/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1...
Traceback (most recent call last):
  File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 492, in _clean_content
    text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
  File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 548, in _parse_and_clean_wikicode
    section_text.append(section.strip_code().strip())
  File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/wikicode.py", line 639, in strip_code
    stripped = node.__strip__(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 60, in __strip__
    return self.normalize()
  File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 150, in normalize
    return chr(htmlentities.name2codepoint[self.value])
KeyError: '000nbsp'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "download_dataset_all.py", line 8, in <module>
    dataset = load_dataset('wikipedia', language=language, date="20210320", beam_runner='DirectRunner') 
  File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 748, in load_dataset
    use_auth_token=use_auth_token,
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 575, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1152, in _download_and_prepare
    pipeline_results = pipeline.run()
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/pipeline.py", line 564, in run
    return self.runner.run_pipeline(self, self._options)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/direct/direct_runner.py", line 131, in run_pipeline
    return runner.run_pipeline(pipeline, options)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 190, in run_pipeline
    pipeline.to_runner_api(default_environment=self._default_environment))
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 200, in run_via_runner_api
    return self.run_stages(stage_context, stages)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 366, in run_stages
    bundle_context_manager,
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 562, in _run_stage
    bundle_manager)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 602, in _run_bundle
    data_input, data_output, input_timers, expected_timer_output)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/fn_runner.py", line 903, in process_bundle
    result_future = self._worker_handler.control_conn.push(process_bundle_req)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/portability/fn_api_runner/worker_handlers.py", line 378, in push
    response = self.worker.do_instruction(request)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 610, in do_instruction
    getattr(request, request_type), request.instruction_id)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/sdk_worker.py", line 647, in process_bundle
    bundle_processor.process_bundle(instruction_id))
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 1001, in process_bundle
    element.data)
  File "/opt/conda/lib/python3.7/site-packages/apache_beam/runners/worker/bundle_processor.py", line 229, in process_encoded
    self.output(decoded_value)
  File "apache_beam/runners/worker/operations.py", line 356, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 358, in apache_beam.runners.worker.operations.Operation.output
  File "apache_beam/runners/worker/operations.py", line 220, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 717, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1235, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1300, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1395, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 220, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 717, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1235, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1300, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1395, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "apache_beam/runners/worker/operations.py", line 220, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
  File "apache_beam/runners/worker/operations.py", line 717, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/worker/operations.py", line 718, in apache_beam.runners.worker.operations.DoOperation.process
  File "apache_beam/runners/common.py", line 1235, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 1315, in apache_beam.runners.common.DoFnRunner._reraise_augmented
  File "/opt/conda/lib/python3.7/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
    raise exc.with_traceback(traceback)
  File "apache_beam/runners/common.py", line 1233, in apache_beam.runners.common.DoFnRunner.process
  File "apache_beam/runners/common.py", line 581, in apache_beam.runners.common.SimpleInvoker.invoke_process
  File "apache_beam/runners/common.py", line 1368, in apache_beam.runners.common._OutputProcessor.process_outputs
  File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 492, in _clean_content
    text = _parse_and_clean_wikicode(raw_content, parser=mwparserfromhell)
  File "/scratch/user_name/modules/datasets_modules/datasets/wikipedia/2fe8db1405aef67dff9fcc51e133e1f9c5b0106f9d9e9638188176d278fd5ff1/wikipedia.py", line 548, in _parse_and_clean_wikicode
    section_text.append(section.strip_code().strip())
  File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/wikicode.py", line 639, in strip_code
    stripped = node.__strip__(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 60, in __strip__
    return self.normalize()
  File "/opt/conda/lib/python3.7/site-packages/mwparserfromhell/nodes/html_entity.py", line 150, in normalize
    return chr(htmlentities.name2codepoint[self.value])
KeyError: "000nbsp [while running 'train/Clean content']"

lhoestq commented 3 years ago

Hi ! This looks related to this issue: https://github.com/huggingface/datasets/issues/1994 Basically the parser that is used (mwparserfromhell) has some issues for some pages in es. We already reported some issues for es on their repo at https://github.com/earwig/mwparserfromhell/issues/247 but it looks like there are still a few issues. Might be a good idea to open a new issue on the mwparserfromhell repo

kaliaanup commented 2 years ago

Any updates on this so far?

albertvillanova commented 1 year ago

The issue:

KeyError: "000nbsp [while running 'train/Clean content']"

reported in comments:

https://github.com/huggingface/datasets/issues/577#issuecomment-701890059 (by @gaguilar)
https://github.com/huggingface/datasets/issues/577#issuecomment-879513227 (by @mmiakashs)

was normally fixed in the mwparserfromhell library and will be accessible in their next release version 0.7:

https://github.com/earwig/mwparserfromhell/issues/288

Tahlor commented 1 year ago

mwparserfromhell 0.7 has still not been released, but you might have luck with the dev version: pip install git+https://github.com/earwig/mwparserfromhell.git@0f89f44

huggingface / datasets

Some languages in wikipedia dataset are not loading #577

[UPDATE]