Config type safety and auto-complete

Love the platform and low code ideas it brings to ML infra. ❤️

Is your feature request related to a problem? Please describe. LudwigModel accepts yaml string or raw dict as the model config and in many examples you would see decent amount of string yaml hard-coded, but the structure of such yaml is not guided, nor known upfront, which makes it a bit more work to adapt declarative approach in programmatic interface.

Describe the use case Looking at LLM fine tuning: https://colab.research.google.com/drive/1c3AO8l_H6V_x37RwQ8V7M6A-RmcBf2tG?usp=sharing#scrollTo=JfZq1-qbulcg Looks like I can specify preprocessing sample rate, curious what else I can do in preprocessing, would be interested to see if I can set fixed parallelism for expected multi-core utilization 🤔

Describe the solution you'd like Ideally have pydentic classes defining each yaml structure, so you would have type safety, inputs and outputs validation as well as IDE auto-complete for quicker lookup of possible fields and values with documentation in place with code. However, dataclasses that present in the code base for configs also good enough, as they provide auto-complete and possibility for arguments documentation.

Something similar to:

config = LLMConfig(
    base_model="meta-llama/Llama-2-7b-hf", 
    quantization=BitsAndBites(bits=4), 
    adapter=Lora(alpha=16), 
    prompt=Prompt(template="Say hello to {username}"),
    input_features=TextFeature(name="prompt", preprocessing=Preprocessing(max_sequence_length=256)),
    output_feature=TextFeature(name="output", preprocessing=Preprocessing(max_sequence_length=256)),
    preprocessing=Preprocessing(sample_ratio=0.1, parallelism=4)
)
model = LudwigModel(config=config, logging_level=logging.INFO)

Describe alternatives you've considered I see there are such domain models already present: https://github.com/ludwig-ai/ludwig/blob/e46a9890b9f6345a0ba2face03d0e6fcedb909d9/ludwig/features/text_feature.py#L211 They contain mode implementation details that are needed for easier and faster auto-complete search of possible properties, potentially would be faster to reuse those if their API is not meant to change and I am assuming the config yaml is de-serialized into such classes anyway.

Additional context This should help not only adaption of the declarative approach, but also can save de-serialization process.

ludwig-ai / ludwig

Config type safety and auto-complete #3686