Open aldopareja opened 1 month ago
@RobotSail , not sure if this is the best way to use pydantic, something tells me that directly loading the yaml is a bit much?, not sure though. @Maxusmusti , does this sound sensible?. I think that would make the logic of chat template and special tokens much easier to understand -- the current importlib way works but is a bit hacky IMO.
@aldopareja it makes sense, the only things to keep in mind are hints when people are adding their templates (so they know what defaults will be), and also the data_process script uses SPECIAL_TOKENS in more places than within setup_tokenizer
@aldopareja Right now we're just using pydantic for runtime type-checking. You can use it to load in YAMLs as well but it's up to you. Overall I agree with the structure and +1 this change.
Refactor Chat Template and Special Tokens Configuration
The current system relies on Python modules for configuration, which can be less flexible and harder to manage. By using Pydantic models and YAML configuration files, we can streamline the process of adding or modifying chat templates and special tokens by simply specifying yaml configuration files using pydantic.
Affected Files:
src/instructlab/training/config.py src/instructlab/training/main_ds.py src/instructlab/training/chat_templates/* src/instructlab/training/utils.py src/instructlab/training/tokenizer_utils.py ...
Proposed Changes:
class ChatTemplateConfig(BaseModel): template: str
def retrieve_chat_template(chat_templates_yaml_file: str) -> Tuple[str, SpecialTokensConfig]: with open(args.chat_templates_yaml, 'r') as file: config_data = yaml.safe_load(file)