Knowledge doc ingestion

aakankshaduggal commented 1 month ago

nathan-weinberg commented 1 month ago

cc @juliadenham

nathan-weinberg commented 1 month ago

Is this related? https://github.com/instructlab/dev-docs/pull/120 cc @makelinux

nathan-weinberg commented 1 month ago

This one from @jjasghar also seems related? https://github.com/instructlab/dev-docs/pull/106

nathan-weinberg commented 1 month ago

One more https://github.com/instructlab/dev-docs/pull/64

relyt0925 commented 1 month ago

@aakankshaduggal (trying to envision overall flow in my head): I almost view this as proposing two independent yet related enhancements. One is the ability to define references to "documents" in a variety of ways versus just through git references. The other is actually talking about new document formats and how they would be injested.

so do you envision a user will still declaratively define "pointers" in their taxonomy to the backing doc storage similar to what is done today in knowledge like the following example:

document:
  repo: https://github.com/relyt0925/rbc-knowledge
  commit: 99dae176de4927940aee4faaeb0f645b3ee4582b
  patterns:
    - pdf_chunk*.md

However this "declarative definition" is now more flexible in the sense that it no longer has to necessarily just be repo, commit, pattern It could be something like filepath within the base of a taxonomy which could look something like this

document:
  local_directory: documents/docchunks/ 
  patterns:
    - pdf_chunk0.md

Which would then in ilab data generate when I am processing the leaf node lead to the sdg process looking in a local path relative to the "taxonomy base" path for the documents to use in sdg?

(Scoping this comment to comment one which is really a document independent topic). Is there more specifics on the general number of formats that we want to introduce? Do we have specifics on how that document section enhancement would look like?) I ask about the other formats to see if we are bringing in formats that bring in the need for implicit dependencies (like for example a S3 bucket where somehow in the schema we then need to build a flexible way for the user to define how they want to interact with the COS bucket: which could be different in different environments.)

relyt0925 commented 1 month ago

Then 2: the document type enhancement

First question: would it also be accurate to say that as we add in new document types (independent of the ways we reference them): we are still going to keep the declarative nature of the taxonomy where a user will explicitly reference the document in the taxonomy section. SDG then will handle when looking at the document determining it's type and then if it needs to be processed by docling and chunked. It will then produce the chunks (in the example of a 3 MB PDF file about 250 md chunks are produced): and handle ensuring those are processed as the "set" of documents for sdg? This would continue if multiple pdf files were defined?

I am curious if you are envisioning things remaining in that flow versus what I would call a "pre processing" flow where users have to expilictly use the tooling to get the pdf docs converted as a pre req step to setting up a taxonomy, then create a knowledge repo (that would always only contain markdown documents), and then create a leaf node that points to the markdown documents only. Does that make sense the difference at a high level on what I am talking on?

So basically in option 1 which I think is what we are after: SDG would see as it's parsing the leaf node something like

document:
  local_directory: documents/pdfs/ 
  patterns:
    - pdf1.pdf

Then know in processing: ok this doc is type PDF: first I need to go through and convert the pdf document to markdown chunks. Let me automatically do that. Then ok: now I know all these chunks are the full set of "documents" I am running for the leaf node. Ok let me then take that and run that for the leaf node and now we are off to the races same flow we have currently. Same idea for docx files or any other file type we add.

instructlab / dev-docs

Knowledge doc ingestion #148