Reproducible results with HFModel or MLC

Sohaila-se commented 10 months ago

I noticed that the results are not reproducible. I am using Llama with BootstrapFewShot, and every time I compile the same program, I get totally different results (not even close). I noticed in the Predict forward function that the temperature is changed, so is there a way to set the temperature to 0 to be able to get reproducible results?

okhat commented 10 months ago

Are you using llama through TGI client or through HFModel?

okhat commented 10 months ago

If you use the TGI client, all results are reproducible. The HFModel is supported now only on a "best effort" basis. No promises that it's great yet.

okhat commented 10 months ago

We have documentation for TGI here: https://github.com/stanfordnlp/dspy/blob/main/docs/language_models_client.md#tgi

Sohaila-se commented 10 months ago

Well, thank you Omar. I am actually using the MLC client following this tutorial docs/using_local_models.md

okhat commented 10 months ago

Makes sense. HFModel and especially MLC are currently great for demos / exploration, but they're not going to be reliable for systematic tests. MLC in particular is quantized so a lot of noise could exist across runs.

TGI client (for open models) and OpenAI are resilient. (You can try to do compiled_program.save('path') and later load it. That's one option for reproducibility.)

Otherwise I've assigned this issue to myself to make these reproducible through caching. This is expected to take 1--2 weeks.

Sohaila-se commented 10 months ago

I will give the TGI client a try. Thank you @okhat.

Sohaila-se commented 10 months ago

I was able to get reproducible results by setting the temperature of Llama to zero.

In hf_client.py I passed **kwargs to ChatConfig.

class ChatModuleClient(HFModel):
    def __init__(self, model, model_path, **kwargs):
        super().__init__(model=model, is_client=True)

        from mlc_chat import ChatModule
        from mlc_chat import ChatConfig

        self.cm = ChatModule(model=model, lib_path=model_path, chat_config=ChatConfig(**kwargs))

kwarg = {
    "temperature": 0.0,
    "conv_template": "LM",
}

llama = dspy.ChatModuleClient(model='mlc-chat-Llama-2-7b-chat-hf-q4f16_1', 
                              model_path='dist/prebuilt/lib/Llama-2-7b-chat-hf-q4f16_1-cuda.so',
                              **kwarg)

dspy.settings.configure(lm=llama)

This can also be done by editing mlc-chat-config.json

okhat commented 10 months ago

Thanks @Sohaila-se ! Just fyi that MLC may be missing some features and is heavily quantized so use with some caution

darinkishore commented 8 months ago

Makes sense. HFModel and especially MLC are currently great for demos / exploration, but they're not going to be reliable for systematic tests. MLC in particular is quantized so a lot of noise could exist across runs.

TGI client (for open models) and OpenAI are resilient. (You can try to do compiled_program.save('path') and later load it. That's one option for reproducibility.)

Otherwise I've assigned this issue to myself to make these reproducible through caching. This is expected to take 1--2 weeks.

@okhat, quick question. I've started running dspy on my larger tasks. When I save the TGI pipeline (or even an openai one), the lm is set to none. Is this due to improper config on my part?

    lm = dspy.OpenAI(model=model_name, api_key=os.getenv('OPENAI_API_KEY'))
    dspy.settings.configure(lm=lm)
    if RECOMPILE_INTO_LLAMA_FROM_SCRATCH:
        tp = BootstrapFewShot(metric=metric_EM, max_bootstrapped_demos=2, max_rounds=1, max_labeled_demos=2)
        compiled_boostrap = tp.compile(DetectConcern(), trainset=train_examples[:10], valset=train_examples[-10:])
        print("woof")
        try:
            compiled_boostrap.save(os.path.join(ROOT_DIR, "dat", "college", f"{model_name}_concerndetect.json"))

outputs a json file after the pipeline with

{
  "most_likely_concerns": {
    "lm": null,
    "traces": [],
    "train": [],
    "demos": "omitted"
},
  "concern_present": {
    "lm": null,
    "traces": [],
    "train": [],
    "demos": "omitted"
},

I have identical output (in terms of LM field, traces, train) when saving a Llama-13b pipeline. Am I doing something obviously wrong, or is saving still inconsistent? My pipeline (don't think it would matter) is below for reference:

class DetectConcern(dspy.Module):
    def __init__(self):
        super().__init__()
        self.most_likely_concerns = dspy.Predict(MostLikelyConcernsSignature, max_tokens=100)
        self.concern_present = dspy.ChainOfThought(ConcernPresentSignature)

    def forward(self, title, post, possible_concerns):
        # Get the most likely concerns
        most_likely_concerns = self.most_likely_concerns(
            title=title, post=post, possible_concerns=possible_concerns
        ).most_likely_concerns

        # Process the concerns
        cleaned_concerns = clean_concerns_to_list(most_likely_concerns)
        detected_concerns = []

        # for first six concerns, check if they are present in the post
        for clean_concern in cleaned_concerns[:6]:
            concern_present = self.concern_present(
                title=title, post=post, concern=clean_concern
            )
            is_concern_present = concern_present.concern_present
            reasoning = concern_present.reasoning
            if true_or_false(is_concern_present):
                detected_concerns.append(clean_concern)

        detected_concerns = ', '.join(detected_concerns)
        return detected_concerns

At the moment, would a good workaround be pickling?

okhat commented 8 months ago

@darinkishore LM=none is not a bug here. None just means use the current LM.

stanfordnlp / dspy

Reproducible results with HFModel or MLC #140