Closed sarrabenyahia closed 4 months ago
How did you construct train_dataset
and test_dataset
? Do you have all the necessary pieces i.e.
GLiNERDataset(
data_gliner,
model_config,
tokenizer,
words_splitter,
data_processor,
)
Also I didn't need to use model.to("mps")
for training on GPU to work. I haven't finetuned that specific model, but finetuning "knowledgator/gliner-multitask-large-v0.5"
works for me.
Here is what I did for train and test data :
train_path = "data.json"
with open(train_path, "r") as f: data = json.load(f)
print('Dataset size:', len(data))
random.shuffle(data) print('Dataset is shuffled...')
train_dataset = data[:int(len(data)0.9)] test_dataset = data[int(len(data)0.9):]
print('Dataset is splitted...')
and
data_collator = DataCollator(model.config, data_processor=model.data_processor, prepare_labels=True)
@sarrabenyahia I followed the code for train.py
here, which says to case the data to GLiNERDataset
.
Thank you for your help. I tried with the code you followed. I changed only this line : device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu') by device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('mps')
I still get the same error :
venv) sarrabenyahia@MacBook-Air-de-Sarra demo % python train_demo.py --config ./config.yaml
Dataset size: 19635
Dataset is shuffled...
Dataset is splitted...
Collecting all entities...
100%|████████████████████████████████████████████████████████████████████████████████████| 17671/17671 [00:00<00:00, 419869.97it/s]
Total number of entity classes: 5606
Collecting all entities...
100%|██████████████████████████████████████████████████████████████████████████████████████| 1964/1964 [00:00<00:00, 617234.61it/s]
Total number of entity classes: 1357
Some weights of CamembertModel were not initialized from the model checkpoint at almanach/camembert-bio-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/Users/sarrabenyahia/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/training_args.py:1494: FutureWarning: evaluation_strategy
is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy
instead
warnings.warn(
max_steps is given, it will override any value given in num_train_epochs
0%| | 0/30000 [00:00<?, ?it/s]Skipping iteration due to error: Placeholder storage has not been allocated on MPS device!
Traceback (most recent call last):
File "/Users/sarrabenyahia/Documents/GitHub/GLiner-TransbronchialBiopsy/src/finetuning/demo/train_demo.py", line 97, in
@sarrabenyahia can you remove this line:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('mps')
I found that setting the device was unnecessary.
tried it just now, still gives me same error ValueError: Calculated loss must be on the original device: cpu but device in use is mps:0
@sarrabenyahia I just ran into this issue on transformers==4.42. I had to downgrade to 4.41 to get this working on MPS.
Thank you! It works when downgrading the package.
I am trying to fine-tune the almanach/camembert-bio-gliner-v0.1 model (but at this point I also tried to do it with the model in the example )
I want to change the device used to mps to use my M2 chip for fine-tuning. I use torch version 2.3.1 and python 3.11.8. I changed these steps of the original code:
device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('mps')
model.to(device) print(f"done on {device}")
num_steps = 500 batch_size = 8 data_size = len(train_dataset) num_batches = data_size // batch_size num_epochs = max(1, num_steps // num_batches)
print(f"Model device before Trainer: {next(model.parameters()).device}")
print(f"Model device after TrainerArguments initialization: {next(model.parameters()).device}")
trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, tokenizer=model.data_processor.transformer_tokenizer, data_collator=data_collator, )
print(f"Model device after Trainer initialization: {next(model.parameters()).device}")
trainer.train()
And here is the final print :
**Model device before Trainer: mps:0 Model device after TrainerArguments initialization: mps:0 0%| | 0/2209 [02:40<?, ?it/s] Model device after Trainer initialization: cpu
0%| | 0/2209 [00:00<?, ?it/s] Skipping iteration due to error: Placeholder storage has not been allocated on MPS device!**
And the error :
ValueError Traceback (most recent call last) Cell In[21], line 44 33 trainer = Trainer( 34 model=model, 35 args=training_args, (...) 39 data_collator=data_collator, 40 ) 42 print(f"Model device after Trainer initialization: {next(model.parameters()).device}") ---> 44 trainer.train()
File ~/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1930 hf_hub_utils.enable_progress_bars() 1931 else: -> 1932 return inner_training_loop( 1933 args=args, 1934 resume_from_checkpoint=resume_from_checkpoint, 1935 trial=trial, 1936 ignore_keys_for_eval=ignore_keys_for_eval, 1937 )
File ~/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/trainer.py:2279, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 2277 else: 2278 if tr_loss.device != tr_loss_step.device: ... 2281 ) 2282 tr_loss += tr_loss_step 2284 self.current_flos += float(self.floating_point_ops(inputs))
ValueError: Calculated loss must be on the original device: cpu but device in use is mps:0
I can't seem to pinpoint the origin of the problem, after trying with the different parameters of the TrainingArguments (no_cuda, use_mps_device=True, etc...)