instructlab / training

InstructLab Training Library
Apache License 2.0
8 stars 28 forks source link

Refactor Chat Template and Special Tokens Configuration #59

Open aldopareja opened 1 month ago

aldopareja commented 1 month ago

Refactor Chat Template and Special Tokens Configuration

The current system relies on Python modules for configuration, which can be less flexible and harder to manage. By using Pydantic models and YAML configuration files, we can streamline the process of adding or modifying chat templates and special tokens by simply specifying yaml configuration files using pydantic.

Affected Files:

src/instructlab/training/config.py src/instructlab/training/main_ds.py src/instructlab/training/chat_templates/* src/instructlab/training/utils.py src/instructlab/training/tokenizer_utils.py ...

Proposed Changes:

  1. Define Pydantic Models for Configuration tokenizer_utils.py
    
    # Define new Pydantic models
    class SpecialTokensConfig(BaseModel):
    system: str
    user: str
    assistant: str
    eos: str
    pad: str

class ChatTemplateConfig(BaseModel): template: str


2. add a yaml file with this sort of structure:

```yaml
special_tokens:
  system: "<|system|>"
  user: "<|user|>"
  assistant: "<|assistant|>"
  eos: "<|eos|>"
  pad: "<|pad|>"

chat_template:
  template: |
    {% for message in messages %}
    {% if message['role'] == 'pretraining' %}
    "{{'<|eos|>' + message['content'] + '<|eos|>'}}"
    {% elif message['role'] == 'system' %}
    "{{'<|system|>'+ '\n' + message['content'] + '\n'}}"
    {% elif message['role'] == 'user' %}
    "{{'<|user|>' + '\n' + message['content'] + '\n'}}"
    {% elif message['role'] == 'assistant' %}
    "{{'<|assistant|>' + '\n' + message['content'] + '<|eos|>' + ('' if loop.last else '\n')}}"
    {% endif %}
    {% endfor %}
  1. Load Configuration from YAML File tokenizer_utils.py
    
    import yaml

def retrieve_chat_template(chat_templates_yaml_file: str) -> Tuple[str, SpecialTokensConfig]: with open(args.chat_templates_yaml, 'r') as file: config_data = yaml.safe_load(file)

special_tokens_config = SpecialTokensConfig(**config_data['special_tokens'])
chat_template_config = ChatTemplateConfig(**config_data['chat_template'])

return chat_template_config.template, special_tokens_config
4. Use the function in tokenizer_utils.py
```python
def setup_tokenizer(model_name_or_path, chat_template_yaml_path):
    special_tokens, chat_template = retrieve_chat_template(chat_template_yaml_path)
    ...
  1. Change data_process to use the new method.
  2. Cleanup, remove previous flow from the codebase.
    • main_ds.py
    • utils.py ...
aldopareja commented 1 month ago

@RobotSail , not sure if this is the best way to use pydantic, something tells me that directly loading the yaml is a bit much?, not sure though. @Maxusmusti , does this sound sensible?. I think that would make the logic of chat template and special tokens much easier to understand -- the current importlib way works but is a bit hacky IMO.

Maxusmusti commented 1 month ago

@aldopareja it makes sense, the only things to keep in mind are hints when people are adding their templates (so they know what defaults will be), and also the data_process script uses SPECIAL_TOKENS in more places than within setup_tokenizer

RobotSail commented 1 month ago

@aldopareja Right now we're just using pydantic for runtime type-checking. You can use it to load in YAMLs as well but it's up to you. Overall I agree with the structure and +1 this change.