run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.33k stars 4.96k forks source link

[Feature Request]: Website scrapers comprehensive ingestion #6517

Closed jon-chuang closed 11 months ago

jon-chuang commented 1 year ago

Feature Description

Website contains many mixed media including images, but also PDFs, slides, youtube links with transcripts available. One should handle ingestion of all of the linked media under a web domain for comprehensiveness.

Reason

No response

Value of Feature

No response

Disiok commented 1 year ago

definitely agreed, supporting multi-media documents is a big next step we are working on!

dosubot[bot] commented 11 months ago

Hi, @jon-chuang! I'm Dosu, and I'm here to help the LlamaIndex team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you made a feature request for a website scraper that can handle ingestion of various types of media, with an emphasis on comprehensive ingestion for a web domain. Disiok has commented, agreeing with the request and mentioning that supporting multi-media documents is a big next step they are working on.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LlamaIndex repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your contribution and we look forward to hearing from you soon!