Open volkfox opened 4 months ago
That's great idea!
It seems you are also proposed the idea of returning dict and use keys of the dict as return signals. I recommend creating a separate issue for that - these two are not related to each other and dict as an output might be challenging issue since we have a built-in dict already.
Without this, the API should look the one below. @volkfox please correct me if I'm missing anything.
def text_block(id: int, sender: Iterator[str], text: Iterator[str]) -> tuple[int, str]:
columns = zip(text, sender)
conversation = ""
for text, sender in columns:
conversation = "\n ".join([conversation,f"{sender}: {text}"])
yield id, conversation
chain = (
DataChain.from_csv('gs://datachain-demo/chatbot-csv/')
.agg(res=text_block, partition_by='id', output={"id": int, "conversation": str} )
.save()
)
Description
Here is a sample generator from LLM tutorial:
This syntax has a number of issues:
Input Column names are implictly made into list names. This is awkward because argument "sender" is a list that would be better named "senders".
Passing lists from SQL limits out-of-memory operations
The aggregation key when passed as a parameter does not have to be a list because it is identical in every record
Here is a proposed updated signature: