Open gotzmann opened 7 months ago
@gotzmann I think you're referring to 2 things:
I'm using Unsloth not initializing any collators explicitly within my code. So basically trainer just process raw texts without any prompts and masks:
trainer = SFTTrainer(
dataset_text_field = "text",
packing = False,
What I'd like to understand:
1) How to use Unsloth with system prompting, adding some more information to implement train_on_inputs = false
feature
2) How to pack aftermentioned examples with prompting - to maximize performance
I'll try to grok thru mentioned links, but those looks too complicated for me :)
@gotzmann Would it be possible if you could show an approx few rows of your dataset, and what the required output would be?
I'm evaluating different open datasets, converting all of them to ShareGPT format for easy of use.
Like this one, where conversations
column contains system
/ human
/ gpt
parts of one input piece:
https://huggingface.co/datasets/teknium/OpenHermes-2.5
Basically what I'd want to implement directly with Unsloth and HF Transformers should be similar like this black magic code from Axolotl:
https://github.com/OpenAccess-AI-Collective/axolotl/blob/main/src/axolotl/prompt_tokenizers.py
tokenized_prompt = self._tokenize(user_prompt, add_eos_token=False)
if not self.train_on_inputs:
user_prompt_len = len(tokenized_prompt["input_ids"])
# TODO this could be sped up using numpy array slicing
tokenized_prompt["labels"] = [IGNORE_INDEX] * user_prompt_len
tokenized_res_prompt = self._tokenize(
response, strip_bos_token=True, add_eos_token=True
)
tokenized_prompt["input_ids"] += tokenized_res_prompt["input_ids"]
tokenized_prompt["attention_mask"] += tokenized_res_prompt["attention_mask"]
tokenized_prompt["labels"] += tokenized_res_prompt["input_ids"]
There special token for masking inputs:
IGNORE_TOKEN_ID = -100
@gotzmann Cool thanks for the info! Redditors also mentioned on an example of ShareGPT style conversations - I'll see what I can do to make a Colab notebook :)
@gotzmann https://www.reddit.com/r/LocalLLaMA/comments/1ail8jr/qlora_with_sharegpt_and_chatml_template_ready_to/ :) Looks like someone made a Unsloth example for ShareGPT style datasets just today!!
@danielhanchen sorry, nothing particularly interesting there.
Packing = False and there no any juggling with attention / inputs matrixes
Been there, done that :) There will be trashy model as output
@gotzmann Yes but it partially solves ur first issue of using a ShareGPT style format - I think train only on inputs isnt done in that notebook. It will require some more custom codepaths.
For packing - again this is a feature request, so depending on how much bandwidth we have, I'll see what we can do.
Hey, I'm using Unsloth with 48Gb cards where it able to PRE-train models up to 70B with 4K context.
Is it possible to use Unsloth to do SFT with instructions on which tokens should be ignored / masked, and attention matrix for properly packing samples?
Please help with some examples if it possible. Have to use Axolotl for SFT tasks now.