Closed branhoff closed 4 months ago
In the description for this feature, I had the objective of converting PDF's to JSON's. However, to use GROBID as suggested by the paper, you first need to convert to TEI .xml files. Then convert those into .json files. The conversion of .xml's into .json's is trickier than I was expecting because we need to extract the useful information, not simply do a 1:1 translation.
I think a new feature is best for this enhancement. I will create a feature that covers that functionality and restrict this feature to just converting PDFs into TEIs.
In this paper RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture, a rough structure is outlined for transforming data into a usable state for RAG and fine-tuning enhancements of an LLM.
The goal of this feature is to develop and integrate a robust system that efficiently transforms PDF documents into structured JSON format. This transformation process is crucial for facilitating the analysis, storage, and manipulation of data extracted from PDF files, which are often complex and unstructured. The feature aims to address the challenges associated with extracting not only the textual content but also the inherent structure (e.g., sections, subsections, tables, and figures) from PDF documents, which vary significantly in layout and formatting.
Objectives: