Just like in #7 we need an ensemble of image loaders. The current image loader doesn't work for all images! specially the diagrams embedded in the PDFs.
Discussionns
Two notable feedback on this besides the commit suggestion:
pymupdf4llm is not the only library that gives us images, right? That means there should be other corresponding image loaders, especially those corresponding to marker.
ImageLoader classes should load only images corresponding to the pages and not the text as well in the weave dataset to avoid redundant storage.
class PyMuPDF4LLMImageLoader(PyMuPDF4LLMTextLoader):
I deleted the existing TextLoader to break it into the BaseTextLoader and the Library specfic classes. On doing so, the other Image and TextImage loaders started erroring out. Since the original text loader was PyMuPDF based, I replaced the deleted TextLoader with PyMuPDF4LLMTextLoader.
Your older ImageLoaders were using TextLoader as base the classe, and since the current PR scope was limited to Text only, I left them as is.
Your suggestion hints towards writing an ensemble of ImageLoaders, just like the current PR is doing the same for text, I suggest we take that to the next PR, dedicated to ImageLoaders... keeping the current PR limited to it's scope and prevent it getting bigger.
Problem
Just like in #7 we need an ensemble of image loaders. The current image loader doesn't work for all images! specially the diagrams embedded in the PDFs.
Discussionns
_Originally posted by @soumik12345 in https://github.com/soumik12345/medrag-multi-modal/pull/9#discussion_r1805238112_
_Originally posted by @mratanusarkar in https://github.com/soumik12345/medrag-multi-modal/pull/9#discussion_r1805296269_
Solution
convert the current load_image.py into three different text loaders: