iterative / datachain

AI-dataframe to enrich, transform and analyze data from cloud storages for ML training and LLM apps
https://docs.datachain.ai
Apache License 2.0
706 stars 38 forks source link

Consolidation of datachain examples with unstructured #353

Open tibor-mach opened 2 weeks ago

tibor-mach commented 2 weeks ago

That one is a bit different as it summarises the text. But otherwise it is rather similar in what it does, so I guess we could simply add one more column to the example with embeddings where we have the text summary for the entire article.

We have this example with unstructured which shows text summarisation and then this example which chunkifies text and creates embeddings. Otherwise they are very similar.

I would merge the two, deleting the example from the datachain repo and adding article text summary to the output in this example in datachain-examples.

@mattseddon @dberenbaum (you seem to have worked on the summarisation example) do yo agree?

cc @shcheklein

shcheklein commented 2 weeks ago

I would merge the two, deleting the example from the datachain repo

I think it's better to keep in the datachain repo if possible. It's not a jupyter notebook, we already have tests for this. Atm datachain-examples doesn't look stable tbh, doesn't have a good structure, linters, etc, etc, etc.

This example has become one of the basics one we show to users.

tibor-mach commented 2 weeks ago

Well, in that case I would just keep both of them as they are....or keep all examples in the datachain repo as before. I don't see a clear rule by which to keep examples in one repo or the other at the moment.

shcheklein commented 2 weeks ago

I don't see a clear rule by which to keep examples in one repo or the other at the moment.

I think we wanted to migrate notebooks initially? that was pretty much the rule

if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo

Well, in that case I would just keep both of them as they are.

could you clarify this a bit? what would be the reason to have / maintain two of them?

tibor-mach commented 2 weeks ago

if it's a single script that can run (on a subset of data) sufficiently fast and represents a high-level use case / example - I think we can keep it in the main repo

Ok that makes sense. Then I take that back - it does make sense to only keep one script. But I would then take the one from datachain-examples and add its content to the one in datachain which does summarisation (so it will also include embeddings and all the stuff from the blogpost).

We will still have the issue of maintaining the notebooks there which often have similar code (with just more text around it). But this will at least reduce the amount of duplication, even if it does not eliminate it completely.

tibor-mach commented 2 weeks ago

I played around with the examples a bit and I am not very happy with any version which combines both the summarisation and chunking/embeddings in a single script. I thought they were more or less demonstrating the same thing, but I no longer think so.

The examples work on a different level of granularity (w.r.t. the document) and they use different datachain methods as well. The example from @dberenbaum works on the level of an individual file, uses .map to create a table where each row represents the file and a summary of its content.

The example with embeddings uses .gen to create a lot of rows, one row represents one chunk of a partitioned document. It doesn't make much sense to summarise chunks and while the whole document summary could be copy-pasted to each row generated from that document, I think that is unnecessary duplication and kind of goes against the idea that we do not copy anything extra in DataChain.

Alternatively, it could be kept as a single longer script with multiple steps but then it becomes harder to read than each of the two separate examples and I don't think it would reduce maintenance much anyway.

Both examples use unstructured and work with text, but otherwise they show different things. So I would just move the script from datachain-examples to datachain and then make the tests better so that it is more stable.

shcheklein commented 2 weeks ago

sounds good @tibor-mach !