second-opinion-ai / second-opinion

6 stars 0 forks source link

Implement PDF to TEI XML transformation feature for enhanced data structuring #25

Closed branhoff closed 4 months ago

branhoff commented 4 months ago

In this paper RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, a rough structure is outlined for transforming data into a usable state for RAG and fine-tuning enhancements of an LLM.

The goal of this feature is to develop and integrate a robust system that efficiently transforms PDF documents into structured JSON format. This transformation process is crucial for facilitating the analysis, storage, and manipulation of data extracted from PDF files, which are often complex and unstructured. The feature aims to address the challenges associated with extracting not only the textual content but also the inherent structure (e.g., sections, subsections, tables, and figures) from PDF documents, which vary significantly in layout and formatting.

Objectives:

  1. Implement algorithms or leverage existing machine learning libraries to extract both the text and its logical structure (sections, titles, tables, etc.) from PDF documents.
  2. Convert the extracted content into a structured, easily queryable JSON format that retains the document's original structure and formatting cues.
  3. Ensure the extraction process is accurate and efficient, capable of processing a large volume of PDF files without significant loss of fidelity in the conversion.
branhoff commented 4 months ago

In the description for this feature, I had the objective of converting PDF's to JSON's. However, to use GROBID as suggested by the paper, you first need to convert to TEI .xml files. Then convert those into .json files. The conversion of .xml's into .json's is trickier than I was expecting because we need to extract the useful information, not simply do a 1:1 translation.

I think a new feature is best for this enhancement. I will create a feature that covers that functionality and restrict this feature to just converting PDFs into TEIs.