microsoft / PubSec-Info-Assistant

Information Assistant, built with Azure OpenAI Service, Industry Accelerator
MIT License
297 stars 630 forks source link

URL/Website content extraction #373

Open nkidambi opened 9 months ago

nkidambi commented 9 months ago

Feature request Many federal customers have public documents (PDFs) and websites (including FAQs) that they would like to search using Info-Assistant.

Additional Details Support crawling and extracting content from URL/website with recursion up to a certain configurable depth. Also provide support for filtering out certain URLs like forms, pages that call APIs (like office locator and such) and/or certain domains.

dmitri012 commented 9 months ago

Document loading I can solve my self by uploading files to appropriate blob container. Website content extraction needs more specification about supported formats (html, txt, json, etc). Ideally is to provide ready scraper script/function. I my case I have access to database and can extract content via SQL queries, but supported output format is not clear yet.

dayland commented 8 months ago

@dmitri012, the supported document formats are available at https://github.com/microsoft/PubSec-Info-Assistant/blob/main/docs/features/features.md#supported-document-types

jdnuckolls commented 4 months ago

I'd really like to see this take priority as it is a requirement of just about every customer my team works with. Usually when we tell them this feature is not available in this repo, they use a different repo that already has this feature and they mis out on all the great work that has gone into this repo.