gordonwatts / snowmass-chat

Experiments exploring the US Snowmass Process documents using LLM
Apache License 2.0
2 stars 0 forks source link

Use unstructured library to parse all PDF's? #16

Open gordonwatts opened 1 year ago

gordonwatts commented 1 year ago

The unstructured library, which is what is used to parse PDF's that we download with https, is not used by the arxiv downloader. However, reading through it, it looks like it is much more capable, and might even be able to extract tables. It would be very cool to work that into the workflow. But is it worth it?

gordonwatts commented 1 year ago

One good thing would be to add pages - so a document per page. That way you could include the page number and the ref number as part of the metadata that you could then cite when the answer came back.

gordonwatts commented 9 months ago

From CoPilot:

Both PyMuPDF and unstructured[local-inference] have their strengths and are used for different purposes.

PyMuPDF, also known as fitz, is a Python binding for the PDF processing library MuPDF. It is excellent for extracting text, images, and metadata from PDF files, which seems to be your immediate need. It also supports various other features like PDF modification, encryption, etc.

On the other hand, unstructured[local-inference] is a part of the Unstructured library, which is a toolkit for working with unstructured data. The local-inference module is used for making local inferences from the data, which seems to be more related to your later need of feeding the data to a vector database and a Language Model.

For scientific documents, PyMuPDF should be sufficient to extract the text. However, if the documents have complex structures, tables, or require understanding of the context, you might need more advanced Natural Language Processing (NLP) tools.

In conclusion, you might end up using both. PyMuPDF for the initial extraction of text from PDFs, and unstructured[local-inference] for further processing and feeding the data to your model.