instructlab / sdg

Python library for Synthetic Data Generation
Apache License 2.0
14 stars 30 forks source link

context-aware chunking #271

Open ktam3 opened 1 week ago

ktam3 commented 1 week ago

Feature Overview (aka. Goal Summary) An elevator pitch (value statement) that describes the Feature clearly and concisely. Complete during New status.

Converting a document with mixed elements (e.g., a PDF with tables, images, text, image captions, multiple text columns per page like in papers, etc.) to Markdown requires identifying the objects and processing them accordingly for a proper Markdown representation (e.g. multi-page table, as an MD table, two-column pages as paragraphs in the correct order, etc.)

To achieve this, we must start working on context-aware chunking of the ingested documents and augment the elements with additional metadata required for identifying the source, location within the source, etc.

Goals (aka. expected user outcomes) The observable functionality that the user now has due to receiving this feature. Include the anticipated primary user type/persona and which existing features will be expanded. Complete during New status.

This feature is to adopt a tool for context-aware chunking for preprocessing the documents during the ingestion pipeline and to be used by the SDG and other steps.

Requirements (aka. Acceptance Criteria): To be considered complete, the feature must deliver a list of specific needs or objectives. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Done - Checklist (mandatory)

bbrowning commented 1 week ago

A clarifying question here - docling is a tool for converting pdfs to markdown (or other similar formats). Integrating docling would mean we could now ingest pdf files instead of just markdown files. Or, a user could pre-process PDFs with docling and store those generated markdowns for use with InstructLab SDG. So, docling helps us turn PDFs into Markdown, which can then be fed as input into the existing SDG process.

So, is this issue just about expanding the types of input documents SDG can handle by using Docling to intelligently convert files like PDFs into Markdown? Or, is this issue about changing our chunking strategy (the way we split documents into multiple smaller pieces to feed our models) to be more aware of the context of the documents instead of just splitting on markdown boundaries with overlap? Those are I believe two orthogonal things that may be conflated in this one issue.

aakankshaduggal commented 1 week ago

Yes, @bbrowning, the goal here is twofold: First, to enable users to easily convert hefty PDFs into the expected markdown format using Docling. This allows us to support a wider range of input formats, like PDFs, in addition to markdown files. Second, we aim to enhance our chunking strategy by implementing semantic chunking for the referenced documents in the qna.yaml, ensuring that chunks are more contextually meaningful rather than relying on simple markdown boundaries with overlap.

bbrowning commented 1 week ago

Thanks @aakankshaduggal for that clarification!

noelo commented 6 days ago

@aakankshaduggal for what it's worth I did a quick POC using the semantic-chunkers component. Code is here WIP