Cache model output on a particular dataset

chandlj commented 3 weeks ago

It would be nice if we could pre-compute a model's output on a particular dataset, and essentially "cache" this for use in an evaluation. For example, we have a large dataset of long-context documents and running our model on this dataset is particularly expensive. If we would like to change our evaluation pipeline at any point, either adding/removing/modifying a computed metric/score on the dataset, then it seems to me that we would have to re-run our model on the entire dataset to get a new evaluation run.

It does seem like you would be able to do this:

dataset = Dataset(
    name="papers",
    rows=[
        {"id": "0", "docs": ..., "output": ...}, # Output is the pre-computed model output, stored at the database level
    ],
)

class IdentityModel(weave.Model):
    @weave.op()
    async def predict(self, docs: ..., output: T) -> T:
        return output

model = IdentityModel()
evaluation = Evaluation(dataset=dataset, scorers=[...]) # Add our metrics here

However, this is obviously not ideal and would probably be confusing in the UI.

andrewtruong commented 3 weeks ago

Hey @chandlj, we're working on "adding calls to a dataset", which I think is what you're asking for.

Basically:

# 1. Create Call objects containing the inputs, outputs, etc.
calls = []
for x in range(3):
    res, call = await model.predict.call(...)
    calls.append(call)

# 2. Generate a dataset from those calls (your pre-computed model outputs)
dataset = Dataset.from_calls(calls)

# 3. Pass to Evaluation as you would normally.
evaluation = Evaluation(dataset=dataset, ...)

Then you can reuse the dataset later using the "Use" tab in the UI

chandlj commented 3 weeks ago

Hey @andrewtruong thanks for the swift reply! When can we expect this feature to be completed?

andrewtruong commented 3 weeks ago

No firm timeline atm, but my current guess would be in the next few weeks!

Would you want to primarily add calls via the API (like above), or via the UI?

chandlj commented 2 weeks ago

We would probably like to do it via the API. I kind of envision it where we would have a dataset of inputs, so like:

dataset = Dataset(name="papers", rows=[{"id": 0, "context": ...}, ...])

calls = []
for entry in dataset:
    res, call = await model.predict(entry)
    calls.append(call)

dataset_with_responses = Dataset.from_calls(calls)

dataset_with_responses.publish(name="papers_with_calls")

...
# Later, using the dataset
dataset = weave.ref("papers_with_calls")

evaluation = Evaluation(dataset=dataset, scorers=[... dynamically changing list of scorers ...])

evaluation.evaluate() # In theory, you would not need to pass the model here because we have already computed outputs

wandb / weave

Cache model output on a particular dataset #2925