Semantic file diff: excel

dmpetrov commented 2 months ago

There is an opportunity to implement semantic diff for a specific file formats. We can implement a couple of diff-s for one specific file format (I propose Excel). Later it can be extend to more formats.

There might be multiple strategies. It would be great to implement a couple different one. Like:

using LLM only: ChatGPT (Claude failed working with my Excel files for some reason)
data ingestion + LLM: unstructured-io + Mistral

User should be able to modify or extend prompt, I guess.

Example:

ChatGPT example on my Excel files with a very naive prompts: Q: Are these files different? A: The two files are not identical. If you need further details on the differences, I can help compare specific sections or elements of the files. Let me know how you'd like to proceed Q: what is the difference? A: The differences between the two files are found in the "Youtube video" column. Specifically: In rows 2, 3, and 5, the first file has "yes" in the "Youtube video" column, while the second file has "no" in the same column.

The response looks pretty good. I wish I can apply it to a dir/bucket and see what changed from last week.

Ideally, we need a mapper function that produced an object with bool columns and description of a difference.

dmpetrov commented 2 months ago

@dberenbaum @volkfox WDYT? How realistically we can implement this? What is the complexity?

dmpetrov commented 4 weeks ago

closing for now. we should reopen when there is more clarity with use cases

iterative / datachain

Semantic file diff: excel #372

Example: