ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.08k stars 1.19k forks source link

Config type safety and auto-complete #3686

Open Tradunsky opened 11 months ago

Tradunsky commented 11 months ago

Love the platform and low code ideas it brings to ML infra. ❤️

Is your feature request related to a problem? Please describe. LudwigModel accepts yaml string or raw dict as the model config and in many examples you would see decent amount of string yaml hard-coded, but the structure of such yaml is not guided, nor known upfront, which makes it a bit more work to adapt declarative approach in programmatic interface.

Describe the use case Looking at LLM fine tuning: https://colab.research.google.com/drive/1c3AO8l_H6V_x37RwQ8V7M6A-RmcBf2tG?usp=sharing#scrollTo=JfZq1-qbulcg Looks like I can specify preprocessing sample rate, curious what else I can do in preprocessing, would be interested to see if I can set fixed parallelism for expected multi-core utilization 🤔

Describe the solution you'd like Ideally have pydentic classes defining each yaml structure, so you would have type safety, inputs and outputs validation as well as IDE auto-complete for quicker lookup of possible fields and values with documentation in place with code. However, dataclasses that present in the code base for configs also good enough, as they provide auto-complete and possibility for arguments documentation.

Something similar to:

config = LLMConfig(
    base_model="meta-llama/Llama-2-7b-hf", 
    quantization=BitsAndBites(bits=4), 
    adapter=Lora(alpha=16), 
    prompt=Prompt(template="Say hello to {username}"),
    input_features=TextFeature(name="prompt", preprocessing=Preprocessing(max_sequence_length=256)),
    output_feature=TextFeature(name="output", preprocessing=Preprocessing(max_sequence_length=256)),
    preprocessing=Preprocessing(sample_ratio=0.1, parallelism=4)
)
model = LudwigModel(config=config, logging_level=logging.INFO)

Describe alternatives you've considered I see there are such domain models already present: https://github.com/ludwig-ai/ludwig/blob/e46a9890b9f6345a0ba2face03d0e6fcedb909d9/ludwig/features/text_feature.py#L211 They contain mode implementation details that are needed for easier and faster auto-complete search of possible properties, potentially would be faster to reuse those if their API is not meant to change and I am assuming the config yaml is de-serialized into such classes anyway.

Additional context This should help not only adaption of the declarative approach, but also can save de-serialization process.

Infernaught commented 11 months ago

Hi @Tradunsky. Thank you for the suggestion! I personally think that this might be a little less user-friendly for first-time users, and it wouldn't be compatible with our current notebooks. However, I'd like to get @tgaddair and @justinxzhao's inputs on this because I agree that we should maybe have more checks in place.