pinecone-io / canopy

Retrieval Augmented Generation (RAG) framework and context engine powered by Pinecone
https://www.pinecone.io/
Apache License 2.0
962 stars 119 forks source link

[Feature] :grey_question: Support for html/pdf pages and YT videos ? #165

Open adriens opened 11 months ago

adriens commented 11 months ago

Is this your first time submitting a feature request?

Describe the feature

Hi, I see in the README that canopy actually supports local files likes parquet, is there nay kind of DocumentLoader for online contents like:

... and does it support local pdf import ?

Describe alternatives you've considered

Who will this benefit?

I gues almost any lazy people wanting to give a try and use canopy

Are you interested in contributing this feature?

No response

Anything else?

No response

strelkon commented 11 months ago

Would it be possible to use an existing Pinecone index for Canopy? I built it from PDF files using llama-index; and also the free version of Pinecone does not support more than one index - so it would be great to re-use the existing one...

UPD: Yes, I found in the docs that this is not possible. Although beforehand I have tried to bypass this and first created a Pinecone index through Canopy and then filled it in externally - the server was set up and running without issues, however, when I tried to chat, an error was thrown. So I converted all .pdf in the directory to .txt files and upserted them through Canopy.

pashpashpash commented 10 months ago

PDF is a must have

NB-123 commented 9 months ago

Yes, please include PDF in Canopy

kowshik24 commented 9 months ago

@adriens @NB-123 @pashpashpash @strelkon Check out this library: PineconePDFExtractor it will accept a list of pdf files and convert them back to your desired format so that you can use pinecone-canopy's upsert data into this endpoint: v1/context/upsert on FastAPI

adriens commented 9 months ago

Thanks a lot for pointing this library @kowshik24 :pray: