Closed RonaldVanAalst closed 3 weeks ago
I can share the logging output though :
/anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/data.py:154: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass
include_groups=False
to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. df = df.apply(lambda x: x.sample(min(num_samples, len(x)), random_state=seed)) model_head.pkl not found in /mnt/batch/tasks/shared/LS_root/mounts/clusters/xyz-gpu/code/Users/xyz/notebooks/paraphrase-multilingual-MiniLM-L12-v2, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference. Map: 3%|▎ | 2/64 [00:00<00:00, 733.33 examples/s]AttributeError Traceback (most recent call last) Cell In[10], line 29 9 model = SetFitModel.from_pretrained( 10 "./paraphrase-multilingual-MiniLM-L12-v2", # local model 11 labels=["ok", "nok"], (...) 18 ), 19 ) 21 args = TrainingArguments( 22 batch_size=16, 23 num_epochs=4, (...) 26 load_best_model_at_end=True, 27 ) ---> 29 trainer = Trainer( 30 model=model, 31 args=args, 32 train_dataset=train_dataset, 33 eval_dataset=eval_dataset, 34 metric="accuracy", 35 #column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer, already ood 36 ) 40 # ivm rest error invalid parameter mlflow 41 #from transformers.integrations import MLflowCallback 42 #trainer.remove_callback(MLflowCallback) 43 44 # Train and evaluate 45 trainer.train()
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/trainer.py:244, in Trainer.init(self, model, args, train_dataset, eval_dataset, model_init, metric, metric_kwargs, callbacks, column_mapping) 240 # Add the callback for filling the model card data with hyperparameters 241 # and evaluation results 242 self.add_callback(ModelCardCallback(self)) --> 244 self.callback_handler.on_init_end(args, self.state, self.control)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/transformers/trainer_callback.py:366, in CallbackHandler.on_init_end(self, args, state, control) 365 def on_init_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl): --> 366 return self.call_event("on_init_end", args, state, control)
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/transformers/trainer_callback.py:414, in CallbackHandler.call_event(self, event, args, state, control, kwargs)
412 def call_event(self, event, args, state, control, kwargs):
413 for callback in self.callbacks:
--> 414 result = getattr(callback, event)(
415 args,
416 state,
417 control,
418 model=self.model,
419 tokenizer=self.tokenizer,
420 optimizer=self.optimizer,
421 lr_scheduler=self.lr_scheduler,
422 train_dataloader=self.train_dataloader,
423 eval_dataloader=self.eval_dataloader,
424 **kwargs,
425 )
426 # A Callback can skip the return of control
if it doesn't change it.
427 if result is not None:
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:65, in ModelCardCallback.on_init_end(self, args, state, control, model, **kwargs) 62 model.model_card_data.set_widget_examples(dataset) 64 if self.trainer.train_dataset: ---> 65 model.model_card_data.set_train_set_metrics(self.trainer.train_dataset) 66 # Does not work for multilabel 67 try:
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:315, in SetFitModelCardData.set_train_set_metrics(self, dataset) 312 sample["word_count"] = len(sample["text"].split(" ")) 313 return sample --> 315 dataset = dataset.map(add_naive_word_count) 316 self.train_set_metrics_list = [ 317 { 318 "Training set": "Word count", (...) 322 }, 323 ] 324 # E.g. if unlabeled via DistillationTrainer
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:560, in transmit_format.
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:3035, in Dataset.map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc) 3029 if transformed_dataset is None: 3030 with hf_tqdm( 3031 unit=" examples", 3032 total=pbar_total, 3033 desc=desc or "Map", 3034 ) as pbar: -> 3035 for rank, done, content in Dataset._map_single(**dataset_kwargs): 3036 if done: 3037 shards_done += 1
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:3408, in Dataset._map_single(shard, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset) 3406 _time = time.time() 3407 for i, example in shard_iterable: -> 3408 example = apply_function_on_filtered_inputs(example, i, offset=offset) 3409 if update_data: 3410 if i == 0:
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/datasets/arrow_dataset.py:3300, in Dataset._map_single.
File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:312, in SetFitModelCardData.set_train_set_metrics.
AttributeError: 'NoneType' object has no attribute 'split'
I found that my training dataset contained a record {label:'nok',text:None}.
I still think a better error message for this situation would be a boon.
Hi, I'm trying to run setfit with a local model model = SetFitModel.from_pretrained( "./paraphrase-multilingual-MiniLM-L12-v2", # local model labels=["ok", "nok"]
I get the error mentioned in the title, and apparently that happens in File /anaconda/envs/azureml_py310_sdkv2/lib/python3.10/site-packages/setfit/model_card.py:312, in SetFitModelCardData.set_train_set_metrics..add_naive_word_count(sample)
311 def add_naive_word_count(sample: Dict[str, Any]) -> Dict[str, Any]:
--> 312 sample["word_count"] = len(sample["text"].split(" "))
313 return sample
since I do not care about the model card (yet), is there any way to disable this functionality wholesale ? I lack the capability to get to the bottom of this on my own. I suspect some trainings sample is empty (or contains no words), but I could not find it.
this runs on azure, python 10. %pip uninstall -y huggingface_hub %pip uninstall -y transformers %pip uninstall -y setfit %pip install huggingface_hub==0.23.5
older version transformers b/c eval_strategy not known : https://github.com/huggingface/setfit/issues/528
%pip install setfit==1.0.3 transformers==4.39.0
%pip install setfit==1.0.2 transformers==4.39.0 # no luck either
the strange thing is that the example code (https://github.com/huggingface/setfit) does pass the point where my code crashes : (the Map: line completes for the example code) Map: 3%|▎ | 2/64 [00:00<00:00, 733.33 examples/s]
all in all I think an extra test on empty/None-type sample['text'] values in SetFitModelCardData.set_train_set_metrics..add_naive_word_count (and a decent error message on that) would be of value.
unfortunately I am not at liberty to share the code and input data.