soumik12345 / medrag-multi-modal

0 stars 0 forks source link

feat(loader): implement ensemble of image loaders #10

Closed mratanusarkar closed 2 days ago

mratanusarkar commented 4 days ago

Problem

Just like in #7 we need an ensemble of image loaders. The current image loader doesn't work for all images! specially the diagrams embedded in the PDFs.

Discussionns

Two notable feedback on this besides the commit suggestion:

  1. pymupdf4llm is not the only library that gives us images, right? That means there should be other corresponding image loaders, especially those corresponding to marker.

  2. ImageLoader classes should load only images corresponding to the pages and not the text as well in the weave dataset to avoid redundant storage.

class PyMuPDF4LLMImageLoader(PyMuPDF4LLMTextLoader):

_Originally posted by @soumik12345 in https://github.com/soumik12345/medrag-multi-modal/pull/9#discussion_r1805238112_

I deleted the existing TextLoader to break it into the BaseTextLoader and the Library specfic classes. On doing so, the other Image and TextImage loaders started erroring out. Since the original text loader was PyMuPDF based, I replaced the deleted TextLoader with PyMuPDF4LLMTextLoader.

Your older ImageLoaders were using TextLoader as base the classe, and since the current PR scope was limited to Text only, I left them as is.

Your suggestion hints towards writing an ensemble of ImageLoaders, just like the current PR is doing the same for text, I suggest we take that to the next PR, dedicated to ImageLoaders... keeping the current PR limited to it's scope and prevent it getting bigger.

What do you think?

_Originally posted by @mratanusarkar in https://github.com/soumik12345/medrag-multi-modal/pull/9#discussion_r1805296269_

Solution

convert the current load_image.py into three different text loaders:

mratanusarkar commented 3 days ago

another tool fine: MinerU

soumik12345 commented 3 days ago

another tool fine: MinerU

Let's park it right now for the PoC.