Neural Chat Finetune Mistral Fails

dillonalaird commented 8 months ago

I'm trying to run the neural chat Mistral finetuning example from here. Weirdly enough I can run the pretraining script fine but when I run the finetuning script, which I think runs basically the same codepath, I get the following device error:

Traceback (most recent call last):
  File "/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/multi_modal/train.py", line 338, in <module>
    train()
  File "/intel-extension-for-transformers/intel_extension_for_transformers/neural_chat/examples/finetuning/multi_modal/train.py", line 315, in train
    trainer.train()
  File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 486, in train
    return inner_training_loop(
  File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 842, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/intel-extension-for-transformers/optimum-habana/optimum/habana/transformers/trainer.py", line 1368, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1989, in backward
    loss.backward(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 502, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function SliceBackward0 returned an invalid gradient at index 0 - expected device hpu:0 but got cpu

This only happens after a few calls to backward. The SliceBackward0 seems to be getting introduced on this line but this code gets run by the pretraining script as well. All input tensors, the model, the final loss, etc are all on the hpu device

I'm using the docker file from here built for the Habana Gaudi environment.

dillonalaird commented 8 months ago

@lkk12014402 looks like you put together the multi modal fine tuning example, wondering if you ever came across this issue?

dillonalaird commented 8 months ago

Found some more information, it seems like it's coming from items in llava_v1_5_mix665k.json without images. There's 2 places where this happens, usually a few hundred examples from the OCR VQA dataset that didn't download but there's also about 40k samples without any image tag:

{'id': 'i6IyJda_0',
 'model': '',
 'conversations': [{'from': 'human',
   'value': 'How to tell if a customer segment is well segmented? In 3 bullet points.'},
  {'from': 'gpt',
   'value': '1. Homogeneity: The segment should consist of customers who share similar characteristics and behaviors.\n2. Distinctiveness: The segment should be different from other segments in terms of their characteristics and behaviors.\n3. Stability: The segment should remain relatively stable over time and not change drastically. The characteristics and behaviors of customers within the segment should not change significantly.'}]}

If you strip these examples training seems to run fine, so it appears to be an issue with text-only examples.

dillonalaird commented 8 months ago

Found this issue and fixed it in this PR https://github.com/intel/intel-extension-for-transformers/pull/1199

lkk12014402 commented 8 months ago

the issue is fixed by this PR

intel / intel-extension-for-transformers

Neural Chat Finetune Mistral Fails #1181