huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.04k stars 2.63k forks source link

I'm trying to fine tune the openai/whisper model from huggingface using jupyter notebook and i keep getting this error #6276

Open valaofficial opened 11 months ago

valaofficial commented 11 months ago

Describe the bug

I'm trying to fine tune the openai/whisper model from huggingface using jupyter notebook and i keep getting this error, i'm following the steps in this blog post

https://huggingface.co/blog/fine-tune-whisper

I tried google collab and it works but because I'm on the free version the training doesn't complete

the error comes in jupyter notebook when i run this line

common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

here is the error message

Map (num_proc=4): 0% 0/2506 [00:52<?, ? examples/s]
The above exception was the direct cause of the following exception:

NameError Traceback (most recent call last) Cell In[19], line 1 ----> 1 common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

File ~\anaconda\Lib\site-packages\datasets\dataset_dict.py:853, in DatasetDict.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, desc) 850 if cache_file_names is None: 851 cache_file_names = {k: None for k in self} 852 return DatasetDict( --> 853 { 854 k: dataset.map( 855 function=function, 856 with_indices=with_indices, 857 with_rank=with_rank, 858 input_columns=input_columns, 859 batched=batched, 860 batch_size=batch_size, 861 drop_last_batch=drop_last_batch, 862 remove_columns=remove_columns, 863 keep_in_memory=keep_in_memory, 864 load_from_cache_file=load_from_cache_file, 865 cache_file_name=cache_file_names[k], 866 writer_batch_size=writer_batch_size, 867 features=features, 868 disable_nullable=disable_nullable, 869 fn_kwargs=fn_kwargs, 870 num_proc=num_proc, 871 desc=desc, 872 ) 873 for k, dataset in self.items() 874 } 875 )

File ~\anaconda\Lib\site-packages\datasets\dataset_dict.py:854, in <dictcomp>(.0) 850 if cache_file_names is None: 851 cache_file_names = {k: None for k in self} 852 return DatasetDict( 853 { --> 854 k: dataset.map( 855 function=function, 856 with_indices=with_indices, 857 with_rank=with_rank, 858 input_columns=input_columns, 859 batched=batched, 860 batch_size=batch_size, 861 drop_last_batch=drop_last_batch, 862 remove_columns=remove_columns, 863 keep_in_memory=keep_in_memory, 864 load_from_cache_file=load_from_cache_file, 865 cache_file_name=cache_file_names[k], 866 writer_batch_size=writer_batch_size, 867 features=features, 868 disable_nullable=disable_nullable, 869 fn_kwargs=fn_kwargs, 870 num_proc=num_proc, 871 desc=desc, 872 ) 873 for k, dataset in self.items() 874 } 875 )

File ~\anaconda\Lib\site-packages\datasets\arrow_dataset.py:592, in transmit_tasks.<locals>.wrapper(*args, **kwargs) 590 self: "Dataset" = kwargs.pop("self") 591 # apply actual function --> 592 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) 593 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out] 594 for dataset in datasets: 595 # Remove task templates if a column mapping of the template is no longer valid

File ~\anaconda\Lib\site-packages\datasets\arrow_dataset.py:557, in transmit_format.<locals>.wrapper(*args, **kwargs) 550 self_format = { 551 "type": self._format_type, 552 "format_kwargs": self._format_kwargs, 553 "columns": self._format_columns, 554 "output_all_columns": self._output_all_columns, 555 } 556 # apply actual function --> 557 out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs) 558 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out] 559 # re-apply format to the output

File ~\anaconda\Lib\site-packages\datasets\arrow_dataset.py:3189, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc) 3182 logger.info(f"Spawning {num_proc} processes") 3183 with logging.tqdm( 3184 disable=not logging.is_progress_bar_enabled(), 3185 unit=" examples", 3186 total=pbar_total, 3187 desc=(desc or "Map") + f" (num_proc={num_proc})", 3188 ) as pbar: -> 3189 for rank, done, content in iflatmap_unordered( 3190 pool, Dataset._map_single, kwargs_iterable=kwargs_per_job 3191 ): 3192 if done: 3193 shards_done += 1

File ~\anaconda\Lib\site-packages\datasets\utils\py_utils.py:1394, in iflatmap_unordered(pool, func, kwargs_iterable) 1391 finally: 1392 if not pool_changed: 1393 # we get the result in case there's an error to raise -> 1394 [async_result.get(timeout=0.05) for async_result in async_results]

File ~\anaconda\Lib\site-packages\datasets\utils\py_utils.py:1394, in <listcomp>(.0) 1391 finally: 1392 if not pool_changed: 1393 # we get the result in case there's an error to raise -> 1394 [async_result.get(timeout=0.05) for async_result in async_results]

File ~\anaconda\Lib\site-packages\multiprocess\pool.py:774, in ApplyResult.get(self, timeout) 772 return self._value 773 else: --> 774 raise self._value

NameError: name 'feature_extractor' is not defined

Steps to reproduce the bug

  1. follow the steps in this blog post https://huggingface.co/blog/fine-tune-whisper

  2. run this line of code common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

  3. I'm using jupyter notebook from anaconda

Expected behavior

No error message

Environment info

datasets version: 2.8.0 Python version: 3.11 Windows 10

mariosasko commented 11 months ago

Since you are using Windows, maybe moving the map call inside if __name__ == "__main__" can fix the issue:

if __name__ == "__main__":
    common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

Otherwise, the only solution is to set num_proc=1.

valaofficial commented 11 months ago

Since you are using Windows, maybe moving the map call inside if __name__ == "__main__" can fix the issue:

if __name__ == "__main__":
    common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=4)

Otherwise, the only solution is to set num_proc=1.

Thank you very much for the response, i eventually tried setting num_proc=1 and now the jupyter notebook kernel keers dying after running the command, what do you think the issue could be, could it be that my system is not capable of running the command "i'm using a Lenovo Thinkpad T440 with no GPU"

Ochirsuh commented 10 months ago

Firstly, you didn't define feature_extractor variable. Secondly, it is large nlp model. Hence you should use proper gpu, otherwise your machine's cpu will be overclock and you can do nothing.