iterative / datachain

AI-data warehouse to enrich, transform and analyze unstructured data
https://docs.datachain.ai
Apache License 2.0
1.95k stars 89 forks source link

Make LLM map/gen first-class citizen in `dc` #580

Open shcheklein opened 1 week ago

shcheklein commented 1 week ago

Come up with higher level LLM UDF.

When analyzing data via LLMs (text, images), step by step we have quite a lot of repetitive code like:

def extract_performance(chunk: Chunk) -> CompanyPerformance:
    client = OpenAI()

    completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
            {
                "role": "system",
                "content": "You are a financial analyst and expert. Your task is analayzing public traded company reports.",
            },
            {
                "role": "user",
                "content": f"Given a text extracted from a 10-K SEC form. Return an importance number from 0 to 10 of this text for company performance analysis. Also return an sentiment as a single word - Positive, Negative, Neutral as a predictor how market can react on this information. And an explanation of both - importance and sentiment.\n\n{chunk.text}",
            },
        ],
        response_format=CompanyPerformance
      )

    message = completion.choices[0].message
    if message.parsed:
        return message.parsed
    else:
        return CompanyPerformance(importance=0, sentiment="Neutral")

dc = (
    DataChain.from_dataset("sec-chunks")
      .limit(20)
      .settings(parallel=8)
      .map(pefrormance=extract_performance)
      .save("sec-sentiments")
)

Which is decent enough already, and not very complicated, but I wonder if we can make LLM maps/gens a first class citizen in the language:

PROMPT = "Given a text extracted from a 10-K SEC form. Return an importance ..."

dc = (
    DataChain.from_dataset("sec-chunks")
      .limit(20)
      .settings(parallel=8)
      .llm(client, CompanyPerformance, prompt=PROMPT)
      .save("sec-sentiments")
)
dmpetrov commented 5 days ago

That's great idea!

It would help to identify an actual api - how to pass prompt, how to change ChatGPT to something else, etc

Also, ideally this functionality should be implemented outside of DC class while still use natively. Any idea on how to implement this?

I'm asking because we might have multiple connectors like llm and we cannot put everything in the DC class which is already fat class.

dmpetrov commented 5 days ago

Another though - it can be implement using outlines which seems has a decent support for multiple LLM models.

Beside Pydantic it has structured output for simple types which is nice. It would be great to un queries like "how many people in image" using visual models and getting results directly to table.

shcheklein commented 5 days ago

Yes, I saw outlines - but I was considering it more as a wrapper for open models (vs APIs like OpenAI, etc) ... but yes, if they have a full support and also full support for different types of data (images, texts, etc) - yep we can and should use something like that.

Also, ideally this functionality should be implemented outside of DC class while still use natively. Any idea on how to implement this?

Agreed on DC being too overloaded. In this case can be a function that we pass to gen, map as a start I guess.

When I was suggesting llm() approach that was primarily for the mental exercise reason. Can we completely or to a certain degree reimagine a classical dataframe-like API considering that we have LLMs? Just thinking in that terms is useful I think. But overall, I agree, if there no clear benefits / strong ideas - then we should do it on the lib level.