RonaldVanAalst commented 1 month ago

Hi, I'm trying to run setfit with a local model model = SetFitModel.from_pretrained( "./paraphrase-multilingual-MiniLM-L12-v2", # local model labels=["ok", "nok"]

I get the error mentioned in the title, and apparently that happens in File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:312, in SetFitModelCardData.set_train_set_metrics..add_naive_word_count(sample) 311 def add_naive_word_count(sample: Dict[str, Any]) -> Dict[str, Any]: --> 312 sample["word_count"] = len(sample["text"].split(" ")) 313 return sample

since I do not care about the model card (yet), is there any way to disable this functionality wholesale ? I lack the capability to get to the bottom of this on my own. I suspect some trainings sample is empty (or contains no words), but I could not find it.

this runs on azure, python 10. %pip uninstall -y huggingface_hub %pip uninstall -y transformers %pip uninstall -y setfit %pip install huggingface_hub==0.23.5

older version transformers b/c eval_strategy not known : https://github.com/huggingface/setfit/issues/528

%pip install setfit==1.0.3 transformers==4.39.0

%pip install setfit==1.0.2 transformers==4.39.0 # no luck either

the strange thing is that the example code (https://github.com/huggingface/setfit) does pass the point where my code crashes : (the Map: line completes for the example code) Map: 3%|▎ | 2/64 [00:00<00:00, 733.33 examples/s]

all in all I think an extra test on empty/None-type sample['text'] values in SetFitModelCardData.set_train_set_metrics..add_naive_word_count (and a decent error message on that) would be of value.

unfortunately I am not at liberty to share the code and input data.

RonaldVanAalst commented 1 month ago

I can share the logging output though :

/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/data.py:154: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass include_groups=False to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. df = df.apply(lambda x: x.sample(min(num_samples, len(x)), random_state=seed)) model_head.pkl not found in /mnt/batch/tasks/shared/LS_root/mounts/clusters/xyz-gpu/code/Users/xyz/notebooks/paraphrase-multilingual-MiniLM-L12-v2, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference. Map: 3%|▎ | 2/64 [00:00<00:00, 733.33 examples/s]

AttributeError Traceback (most recent call last) Cell In[10], line 29 9 model = SetFitModel.from_pretrained( 10 "./paraphrase-multilingual-MiniLM-L12-v2", # local model 11 labels=["ok", "nok"], (...) 18 ), 19 ) 21 args = TrainingArguments( 22 batch_size=16, 23 num_epochs=4, (...) 26 load_best_model_at_end=True, 27 ) ---> 29 trainer = Trainer( 30 model=model, 31 args=args, 32 train_dataset=train_dataset, 33 eval_dataset=eval_dataset, 34 metric="accuracy", 35 #column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer, already ood 36 ) 40 # ivm rest error invalid parameter mlflow 41 #from transformers.integrations import MLflowCallback 42 #trainer.remove_callback(MLflowCallback) 43 44 # Train and evaluate 45 trainer.train()

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/trainer.py:244, in Trainer.init(self, model, args, train_dataset, eval_dataset, model_init, metric, metric_kwargs, callbacks, column_mapping) 240 # Add the callback for filling the model card data with hyperparameters 241 # and evaluation results 242 self.add_callback(ModelCardCallback(self)) --> 244 self.callback_handler.on_init_end(args, self.state, self.control)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/transformers/trainer_callback.py:366, in CallbackHandler.on_init_end(self, args, state, control) 365 def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl): --> 366 return self.call_event("on_init_end", args, state, control)

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/transformers/trainer_callback.py:414, in CallbackHandler.call_event(self, event, args, state, control, kwargs) 412 def call_event(self, event, args, state, control, kwargs): 413 for callback in self.callbacks: --> 414 result = getattr(callback, event)( 415 args, 416 state, 417 control, 418 model=self.model, 419 tokenizer=self.tokenizer, 420 optimizer=self.optimizer, 421 lr_scheduler=self.lr_scheduler, 422 train_dataloader=self.train_dataloader, 423 eval_dataloader=self.eval_dataloader, 424 **kwargs, 425 ) 426 # A Callback can skip the return of control if it doesn't change it. 427 if result is not None:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:65, in ModelCardCallback.on_init_end(self, args, state, control, model, **kwargs) 62 model.model_card_data.set_widget_examples(dataset) 64 if self.trainer.train_dataset: ---> 65 model.model_card_data.set_train_set_metrics(self.trainer.train_dataset) 66 # Does not work for multilabel 67 try:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:315, in SetFitModelCardData.set_train_set_metrics(self, dataset) 312 sample["word_count"] = len(sample["text"].split(" ")) 313 return sample --> 315 dataset = dataset.map(add_naive_word_count) 316 self.train_set_metrics_list = [ 317 { 318 "Training set": "Word count", (...) 322 }, 323 ] 324 # E.g. if unlabeled via DistillationTrainer

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:560, in transmit_format..wrapper(*args, *kwargs) 553 self_format = { 554 "type": self._format_type, 555 "format_kwargs": self._format_kwargs, 556 "columns": self._format_columns, 557 "output_all_columns": self._output_all_columns, 558 } 559 # apply actual function --> 560 out: Union["Dataset", "DatasetDict"] = func(self, args, **kwargs) 561 datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out] 562 # re-apply format to the output

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:3035, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc) 3029 if transformed_dataset is None: 3030 with hf_tqdm( 3031 unit=" examples", 3032 total=pbar_total, 3033 desc=desc or "Map", 3034 ) as pbar: -> 3035 for rank, done, content in Dataset._map_single(**dataset_kwargs): 3036 if done: 3037 shards_done += 1

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:3408, in Dataset._map_single(shard, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset) 3406 _time = time.time() 3407 for i, example in shard_iterable: -> 3408 example = apply_function_on_filtered_inputs(example, i, offset=offset) 3409 if update_data: 3410 if i == 0:

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:3300, in Dataset._map_single..apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_examples, offset) 3298 if with_rank: 3299 additional_args += (rank,) -> 3300 processed_inputs = function(fn_args, additional_args, **fn_kwargs) 3301 if isinstance(processed_inputs, LazyDict): 3302 processed_inputs = { 3303 k: v for k, v in processed_inputs.data.items() if k not in processed_inputs.keys_to_format 3304 }

File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:312, in SetFitModelCardData.set_train_set_metrics..add_naive_word_count(sample) 311 def add_naive_word_count(sample: Dict[str, Any]) -> Dict[str, Any]: --> 312 sample["word_count"] = len(sample["text"].split(" ")) 313 return sample

AttributeError: 'NoneType' object has no attribute 'split'

RonaldVanAalst commented 3 weeks ago

I found that my training dataset contained a record {label:'nok',text:None}.

I still think a better error message for this situation would be a boon.

huggingface / setfit

'NoneType' object has no attribute 'split' #566

older version transformers b/c eval_strategy not known : https://github.com/huggingface/setfit/issues/528

%pip install setfit==1.0.3 transformers==4.39.0