add support for using unstructured SaaS API for handling unstructured data pre-processing

Hisma commented 7 months ago

As we know, data used in RAG applications comes from all kinds of sources, and while its easy to work with unstructured data that is in a text format like markdown, html, txt, json, etc, it is not so easy to work with pdfs & images as it requires OCR tools that can widely vary in quality. On top of that, you sometimes have structured data embedded in unstructured data, for instance, a table or graph inside a pdf, which requires complex solutions if you were to build your own code for this. But I recently stumbled upon a free learning course on deeplearning.ai that stood out to me, in that it taught how to easily pre-process unstructured data from all kinds of datatypes, including pdfs & images, and can even extract graphs & tables.
When I started the course, I realized it's using a cloud service called "unstructured" for handling the heavy lifting and making simple api calls to deal with the pre-processing. Here's a link to the site. You can sign up for an API key - https://unstructured.io/api-key-hosted Here's a link to the course that shows how to use the API service in practice for data pre-processing - https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications/lesson/1/introduction In essence, I think it would be a nice addition to this application if it can add support for this service. In doing so, it can let a 3rd party service handle the pre-processing, and this application can focus on what it does best - implementing advanced RAG pipelines, and not having to worry so much about keeping up with the latest open source pdf or image parser, etc. But of course the end user can have the option to still use those local open source tools if they want, similar to how you can use openai api, or a local llm. How hard do you think it would be to integrate the unstructured API as an option?

Hisma commented 7 months ago

looks like you don't have to use their hosted services. The project is open source and you can host a local api. https://github.com/Unstructured-IO/unstructured-api

the pre-processing libraries are all open source as well. https://github.com/Unstructured-IO/unstructured?tab=readme-ov-file

imo this makes this solution even more appealing. simply make it dependent on the end user to have the libraries and api implemented and consider using this for doc pre-processing.

snexus commented 7 months ago

Hi @Hisma , thanks for the suggestion.

Unstructured I/O is already installed as a requirement and supported in the package as an alternative back-end in case native parsing isn't available (for anything that isn't .pdf, .md, and .docx). The project is using the core library though, not the API version.

The last time I checked (approximately six months ago), the PDF parsing wasn't better than what could be achieved with other, much faster parsers. However, things are moving fast, and perhaps it is a good alternative at the moment. I haven't explored Unstructured I/O's OCR capabilities for parsing PDFs and images.

I will watch the course and try it out. It shouldn't be a problem to offer the user a choice between Unstructured I/O or the native parser.

Hisma commented 7 months ago

Thank you! I wasn't aware it was already actually an option. I just recently stumbled on the course (which was created very recently) and was impressed with the unstructured libraries capabilities, particularly around handling images and pdfs that contain nested structured data like tables. I do not recall if llmsearch handled that data well in its current form, but its a feature that would be useful for various industries.

I appreciate you're willing to watch the course and see if its something that can enhance your application. Let us know what you think!

snexus commented 6 months ago

Sorry for delay. I watched the course and tried some of the approaches mentioned there. Advanced, model based methods for PDF parsing are definitely an improvement, especially for documents with tables.

However, on the consumer GPU the speed is of few orders magnitudes slower (it took me 4 minutes to parse 10 page PDF), which makes it impractical for large document bases.

These methods might be useful, however, if there will be a feature to do in-memory (online) processing of 1-2 documents which one would fetch directly from the internet.

Hisma commented 6 months ago

No problem! This was obviously a "nice to have" type of enhancement. Especially as I wanted to see what could be done to address specifically what you mentioned, which is working with documents with tables, which would be commonly encountered in financial/scientific documentation.
What I would like to do is see what sort of other model-based PDF parsing options are out there to find a good balance of performance vs quality for this area. Thanks for looking into this!

Hisma commented 6 months ago

I'll leave this open as I research different approaches.

snexus commented 6 months ago

Great! Happy to look into other approaches, agree, let's leave it open. Quality parsing of complex PDF remains a holy grail of RAG.

snexus / llm-search

add support for using unstructured SaaS API for handling unstructured data pre-processing #108