llmware-ai / llmware

Unified framework for building enterprise RAG pipelines with small, specialized models
https://llmware-ai.github.io/llmware/
Apache License 2.0
6.34k stars 1.49k forks source link

markdown file support #77

Closed hannahPhys closed 10 months ago

hannahPhys commented 12 months ago

is library.add_files currently supporting md files? with my folder of pdfs and markdown files it only displays the pdfs in library output

turnham commented 12 months ago

Hi Hannah, We do not currently support Markdown files. The list of supported files can be found here: https://github.com/llmware-ai/llmware/blob/0bf704661c543e1d34138d72a4b41e13691f7da7/llmware/parsers.py#L180

But we're happy to add this to our feature list. I've turned this issue into an enhancement request

MuhammadNizamani commented 12 months ago

@turnham Can you inform me if this issue will be straightforward to address, or if it's more complex like the VectorDB issue? I'm eager to contribute to this repository.

MuhammadNizamani commented 12 months ago

@turnham can I use Makersuit google llm?

turnham commented 12 months ago

The work involved here would be adding local Python-based markdown parsing support (e.g not requiring connectivity to any particular external service).

Different markdown processors can have some amount of variance in the syntax they support, so the first step would be identifying and vetting the best python markdown processor that handles a broad range of syntax including older and newer markdown elements. I'm sure there are many Python markdown parsers to investigate.

And then the work would be about updating the Parser APIs to support for the creation of blocks, good error handling and building up a good test suite of markdown test documents that include a broad range of syntax (ideally all possible markdown tags/elements).

dahifi commented 11 months ago

As someone who uses MD files for PKM, this would be a huge boon.

Llama hub has this as their implementation, I'm going to see if I can find other examples.

And honestly, why not just treat it as a text file?

doberst commented 11 months ago

@hannahPhys & @dahifi - thanks for your feedback on markdown documents - we have implemented it as your recommended, e.g., treating .md as .txt files - support is in the new version that we dropped today - please feel free to pull from the repo, or a new pip install llmware==0.1.9 - try it out - and keep the feedback coming - thanks for your engagement with the llmware community !