This feature represents converting our TEI/XML files, produced by GROBID, into structured JSONL format. This transformation needs to maintain both the detailed content and the structural organization of the documents, including metadata (titles, authors, abstracts), section headings, bibliographic references, and the overall narrative flow.
Acceptance Criteria:
The conversion process must ensure that the hierarchical organization of sections, subsections, and their associated content is maintained within the JSONL format.
Essential document metadata, including document titles, author details, abstracts, and bibliographic information, must be accurately extracted and represented in the JSONL output.
The JSONL files will serve as input for a Guidance framework-based question generation process. This integration aims to leverage the structural composition and content of the JSONL files to produce precise, coherent, and contextually relevant questions for educational or assessment purposes.
As a follow up from https://github.com/second-opinion-ai/second-opinion/issues/25, we need to convert our TEI xml files into json files which can be passed to an LLM to develop Q&A's.
Description:
This feature represents converting our TEI/XML files, produced by GROBID, into structured JSONL format. This transformation needs to maintain both the detailed content and the structural organization of the documents, including metadata (titles, authors, abstracts), section headings, bibliographic references, and the overall narrative flow.
Acceptance Criteria: