Fine tuning on mps - Githubissues

sarrabenyahia commented 1 month ago

I am trying to fine-tune the almanach/camembert-bio-gliner-v0.1 model (but at this point I also tried to do it with the model in the example )

I want to change the device used to mps to use my M2 chip for fine-tuning. I use torch version 2.3.1 and python 3.11.8. I changed these steps of the original code:

device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('mps')

model.to(device) print(f"done on {device}")

num_steps = 500 batch_size = 8 data_size = len(train_dataset) num_batches = data_size // batch_size num_epochs = max(1, num_steps // num_batches)

print(f"Model device before Trainer: {next(model.parameters()).device}")

output_dir="models",
learning_rate=5e-6,
weight_decay=0.01,
others_lr=1e-5,
others_weight_decay=0.01,
lr_scheduler_type="linear", #cosine
warmup_ratio=0.1,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
num_train_epochs=num_epochs,
eval_strategy="steps",
save_steps = 100,
save_total_limit=10,
dataloader_num_workers = 0,
#use_cpu = False,
report_to="none",
)

print(f"Model device after TrainerArguments initialization: {next(model.parameters()).device}")

trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, eval_dataset=test_dataset, tokenizer=model.data_processor.transformer_tokenizer, data_collator=data_collator, )

print(f"Model device after Trainer initialization: {next(model.parameters()).device}")

trainer.train()

And here is the final print :

**Model device before Trainer: mps:0 Model device after TrainerArguments initialization: mps:0 0%| | 0/2209 [02:40<?, ?it/s] Model device after Trainer initialization: cpu

0%| | 0/2209 [00:00<?, ?it/s] Skipping iteration due to error: Placeholder storage has not been allocated on MPS device!**

And the error :

ValueError Traceback (most recent call last) Cell In[21], line 44 33 trainer = Trainer( 34 model=model, 35 args=training_args, (...) 39 data_collator=data_collator, 40 ) 42 print(f"Model device after Trainer initialization: {next(model.parameters()).device}") ---> 44 trainer.train()

File ~/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs) 1930 hf_hub_utils.enable_progress_bars() 1931 else: -> 1932 return inner_training_loop( 1933 args=args, 1934 resume_from_checkpoint=resume_from_checkpoint, 1935 trial=trial, 1936 ignore_keys_for_eval=ignore_keys_for_eval, 1937 )

File ~/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/trainer.py:2279, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval) 2277 else: 2278 if tr_loss.device != tr_loss_step.device: ... 2281 ) 2282 tr_loss += tr_loss_step 2284 self.current_flos += float(self.floating_point_ops(inputs))

ValueError: Calculated loss must be on the original device: cpu but device in use is mps:0

I can't seem to pinpoint the origin of the problem, after trying with the different parameters of the TrainingArguments (no_cuda, use_mps_device=True, etc...)

michael-wang-enigma commented 1 month ago

How did you construct train_dataset and test_dataset? Do you have all the necessary pieces i.e.

GLiNERDataset(
        data_gliner,
        model_config,
        tokenizer,
        words_splitter,
        data_processor,
    )

Also I didn't need to use model.to("mps") for training on GPU to work. I haven't finetuned that specific model, but finetuning "knowledgator/gliner-multitask-large-v0.5" works for me.

sarrabenyahia commented 1 month ago

Here is what I did for train and test data :

train_path = "data.json"

with open(train_path, "r") as f: data = json.load(f)

print('Dataset size:', len(data))

random.shuffle(data) print('Dataset is shuffled...')

train_dataset = data[:int(len(data)0.9)] test_dataset = data[int(len(data)0.9):]

print('Dataset is splitted...')

and

data_collator = DataCollator(model.config, data_processor=model.data_processor, prepare_labels=True)

michael-wang-enigma commented 1 month ago

@sarrabenyahia I followed the code for train.py here, which says to case the data to GLiNERDataset.

sarrabenyahia commented 1 month ago

Thank you for your help. I tried with the code you followed. I changed only this line : device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('cpu') by device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('mps')

I still get the same error :

venv) sarrabenyahia@MacBook-Air-de-Sarra demo % python train_demo.py --config ./config.yaml Dataset size: 19635 Dataset is shuffled... Dataset is splitted... Collecting all entities... 100%|████████████████████████████████████████████████████████████████████████████████████| 17671/17671 [00:00<00:00, 419869.97it/s] Total number of entity classes: 5606 Collecting all entities... 100%|██████████████████████████████████████████████████████████████████████████████████████| 1964/1964 [00:00<00:00, 617234.61it/s] Total number of entity classes: 1357 Some weights of CamembertModel were not initialized from the model checkpoint at almanach/camembert-bio-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. /Users/sarrabenyahia/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/training_args.py:1494: FutureWarning: evaluation_strategy is deprecated and will be removed in version 4.46 of 🤗 Transformers. Use eval_strategy instead warnings.warn( max_steps is given, it will override any value given in num_train_epochs 0%| | 0/30000 [00:00<?, ?it/s]Skipping iteration due to error: Placeholder storage has not been allocated on MPS device! Traceback (most recent call last): File "/Users/sarrabenyahia/Documents/GitHub/GLiner-TransbronchialBiopsy/src/finetuning/demo/train_demo.py", line 97, in trainer.train() File "/Users/sarrabenyahia/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/trainer.py", line 1932, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/Users/sarrabenyahia/Documents/GitHub/GLiner-TransbronchialBiopsy/venv/lib/python3.11/site-packages/transformers/trainer.py", line 2279, in _inner_training_loop raise ValueError( ValueError: Calculated loss must be on the original device: cpu but device in use is mps:0 0%| | 0/30000 [00:54<?, ?it/s]

michael-wang-enigma commented 1 month ago

@sarrabenyahia can you remove this line: device = torch.device('cuda:0') if torch.cuda.is_available() else torch.device('mps')

I found that setting the device was unnecessary.

sarrabenyahia commented 1 month ago

tried it just now, still gives me same error ValueError: Calculated loss must be on the original device: cpu but device in use is mps:0

michael-wang-enigma commented 1 month ago

@sarrabenyahia I just ran into this issue on transformers==4.42. I had to downgrade to 4.41 to get this working on MPS.

sarrabenyahia commented 1 month ago

Thank you! It works when downgrading the package.

urchade / GLiNER

Fine tuning on mps #158

And here is the final print :

And the error :