deal with wikipedia bug

When trying to use the wikipedia dataset I get the BUG:

Traceback (most recent call last):
  File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 188, in <module>
    main()
  File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 184, in main
    run_pipeline(**args.__dict__)
  File "C:\repos\pmi_masking\src\db_implementation\run_pipeline.py", line 66, in run_pipeline
    dataset = load_and_tokenize_dataset(dataset_name=dataset_name, tokenizer_name=tokenizer_name,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 81, in load_and_tokenize_dataset
    dataset = dataset_name_to_load_function[dataset_name]()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 29, in load_bookcorpus_and_wikipedia_dataset
    wiki = load_wikipedia_dataset()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\src\load_dataset.py", line 23, in load_wikipedia_dataset
    return load_dataset(dataset_path, configuration_name, split=split)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\load.py", line 1791, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 902, in download_and_prepare
    self._save_info()
  File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 2039, in _save_info
    import apache_beam as beam
ModuleNotFoundError: No module named 'apache_beam'

BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out

Before that, when I run .map with multiple processes, I would get an error since apache_beam and multiprocess have conflicting versions of dill

***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11

UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.

I also changed the datasets line in the requirement.txt to datasets[apache-beam].

Reopend 02.07.2023

I get this error:

Traceback (most recent call last): File "", line 1, in ModuleNotFoundError: No module named 'multiprocess'

And i think it is related to this issue: https://github.com/uqfoundation/multiprocess/issues/61

so I'm kinda giving up on wikipedia , and I think that this could be a Windows bug. So I hope runing this on linux will work Alright.

shaigue / pmi_masking

deal with wikipedia bug #29