There is an opportunity to implement semantic diff for a specific file formats. We can implement a couple of diff-s for one specific file format (I propose Excel). Later it can be extend to more formats.
There might be multiple strategies. It would be great to implement a couple different one. Like:
using LLM only: ChatGPT (Claude failed working with my Excel files for some reason)
data ingestion + LLM: unstructured-io + Mistral
User should be able to modify or extend prompt, I guess.
Example:
ChatGPT example on my Excel files with a very naive prompts:
Q: Are these files different?
A: The two files are not identical. If you need further details on the differences, I can help compare specific sections or elements of the files. Let me know how you'd like to proceed
Q: what is the difference?
A: The differences between the two files are found in the "Youtube video" column. Specifically:
In rows 2, 3, and 5, the first file has "yes" in the "Youtube video" column, while the second file has "no" in the same column.
The response looks pretty good. I wish I can apply it to a dir/bucket and see what changed from last week.
Ideally, we need a mapper function that produced an object with bool columns and description of a difference.
There is an opportunity to implement semantic diff for a specific file formats. We can implement a couple of diff-s for one specific file format (I propose
Excel
). Later it can be extend to more formats.There might be multiple strategies. It would be great to implement a couple different one. Like:
User should be able to modify or extend prompt, I guess.
Example:
ChatGPT example on my Excel files with a very naive prompts: Q: Are these files different? A: The two files are not identical. If you need further details on the differences, I can help compare specific sections or elements of the files. Let me know how you'd like to proceed Q: what is the difference? A: The differences between the two files are found in the "Youtube video" column. Specifically: In rows 2, 3, and 5, the first file has "yes" in the "Youtube video" column, while the second file has "no" in the same column.
The response looks pretty good. I wish I can apply it to a dir/bucket and see what changed from last week.
Ideally, we need a mapper function that produced an object with bool columns and description of a difference.