Closed dillonalaird closed 8 months ago
@lkk12014402 looks like you put together the multi modal fine tuning example, wondering if you ever came across this issue?
Found some more information, it seems like it's coming from items in llava_v1_5_mix665k.json without images. There's 2 places where this happens, usually a few hundred examples from the OCR VQA dataset that didn't download but there's also about 40k samples without any image tag:
{'id': 'i6IyJda_0',
'model': '',
'conversations': [{'from': 'human',
'value': 'How to tell if a customer segment is well segmented? In 3 bullet points.'},
{'from': 'gpt',
'value': '1. Homogeneity: The segment should consist of customers who share similar characteristics and behaviors.\n2. Distinctiveness: The segment should be different from other segments in terms of their characteristics and behaviors.\n3. Stability: The segment should remain relatively stable over time and not change drastically. The characteristics and behaviors of customers within the segment should not change significantly.'}]}
If you strip these examples training seems to run fine, so it appears to be an issue with text-only examples.
Found this issue and fixed it in this PR https://github.com/intel/intel-extension-for-transformers/pull/1199
the issue is fixed by this PR
I'm trying to run the neural chat Mistral finetuning example from here. Weirdly enough I can run the pretraining script fine but when I run the finetuning script, which I think runs basically the same codepath, I get the following device error:
This only happens after a few calls to backward. The SliceBackward0 seems to be getting introduced on this line but this code gets run by the pretraining script as well. All input tensors, the model, the final loss, etc are all on the hpu device
I'm using the docker file from here built for the Habana Gaudi environment.