When trying to use the wikipedia dataset I get the BUG:
Traceback (most recent call last):
File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 188, in <module>
main()
File "C:\repos\pmi_masking\create_pmi_masking_vocab.py", line 184, in main
run_pipeline(**args.__dict__)
File "C:\repos\pmi_masking\src\db_implementation\run_pipeline.py", line 66, in run_pipeline
dataset = load_and_tokenize_dataset(dataset_name=dataset_name, tokenizer_name=tokenizer_name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\src\load_dataset.py", line 81, in load_and_tokenize_dataset
dataset = dataset_name_to_load_function[dataset_name]()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\src\load_dataset.py", line 29, in load_bookcorpus_and_wikipedia_dataset
wiki = load_wikipedia_dataset()
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\src\load_dataset.py", line 23, in load_wikipedia_dataset
return load_dataset(dataset_path, configuration_name, split=split)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\load.py", line 1791, in load_dataset
builder_instance.download_and_prepare(
File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 902, in download_and_prepare
self._save_info()
File "C:\repos\pmi_masking\venv\Lib\site-packages\datasets\builder.py", line 2039, in _save_info
import apache_beam as beam
ModuleNotFoundError: No module named 'apache_beam'
BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out
Before that, when I run .map with multiple processes, I would get an error since apache_beam and multiprocess have conflicting versions of dill
***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11
UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.
I also changed the datasets line in the requirement.txt to datasets[apache-beam].
Reopend 02.07.2023
I get this error:
Traceback (most recent call last):
File "", line 1, in
ModuleNotFoundError: No module named 'multiprocess'
When trying to use the wikipedia dataset I get the BUG:
BUT, installing apache_beam could be problematic. maybe I can use this dataset without it? or use a different version? or not use it at all? try to figure this out
Before that, when I run .map with multiple processes, I would get an error since
apache_beam
andmultiprocess
have conflicting versions ofdill
***Solution to try -- try to use python 3.9 instead of python 3.10 / 3.11
UPDATE: OK, so downgrading to python 3.9 seems to work, but I still got a memory allocation error when tokenizing the dataset, so I reduced the tokenizer batch size to 10_000 instead to see if it works better that way.
I also changed the
datasets
line in therequirement.txt
todatasets[apache-beam]
.Reopend 02.07.2023
I get this error:
Traceback (most recent call last): File "", line 1, in
ModuleNotFoundError: No module named 'multiprocess'
And i think it is related to this issue: https://github.com/uqfoundation/multiprocess/issues/61
so I'm kinda giving up on wikipedia , and I think that this could be a Windows bug. So I hope runing this on linux will work Alright.