nickthecook / archyve

GNU Affero General Public License v3.0
124 stars 15 forks source link

How to process an image chunk #137

Open mattlindsey opened 6 days ago

mattlindsey commented 6 days ago

Now that a vision model can be specified in the settings and Archyve can ingest a jpg document into a single chunk, I think that I need some guidance on what to do with it next.

@oxaroky02 said I should "use Setting.vision_model during parsing to get a model and use that with the LLM client API helper to ask for a description via the #image method (See spec/lib/llm_clients/ollama/request_helper_spec.rb line 94)" which sounds good. But I am unclear on which field in the Chunks table the description should be stored in order to embed it properly and for the entity to get created properly (if knowledge graph is enabled).

I have started taking a stab at this, but could use a little help on the Chunks table, and also I am wondering if we need a field in the Chunks table to indicate the 'type' of chunk in order for the jobs to know how to process it. In this case we have an image chunk, which will be processed differently than text, so how do we indicate that?

oxaroky02 commented 6 days ago

Hola @mattlindsey. Also, @nickthecook, let me know if I'm on the right track here.

The current ingest flow takes either a web link or uploaded document and then runs it through "document chunking". This makes sense when the content from the web or document yields textual content in some format.

When the link/document is an image (or audio or video ...) then we need to introduce a flow that can track the media, transform it into textual content and then run that through the chunking.

See #136 where we just separated the current (Fetch | Upload) -> Chunk flow out of the document controller into Mediator#ingest under apps/services.

The next step we're planning is to convert that into a (Fetch | Upload) -> Connvert? -> Chunk flow where the optional conversion will detect if the content is not text and convert to text for supported format.

Once this extra step is in place, this is where the work you started for images would come in, but instead of image -> chunking the image handling would be a "converter" that uses the vision model to produce "text" that would go through the chunking separately.

Let me know if I can clarify this; I may have missed some detail in my head that didn't get transcribed above. 😄

mattlindsey commented 6 days ago

Hi @oxaroky02. Sounds good to me. It seems like you should work on these next steps, but let me know if I can help! :)